RRedon:Protocols/Variation pipeline
get Reference Genome
Download the hg18/build36 from UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes
get SamTools
http://samtools.sourceforge.net/
Index the genome for samtools
samtools faidx hg18.fa
Align FASTQs vs the reference
With BWA
See main article: BWA
With MAQ
index the reference genome
maq fasta2bfa hg18.fa hg18.bfa
- split fastq files in to chunks of 1 million PE reads ← Fix this! why?
- foreach fastq file :
maq fastq2bfq $fq $bfq maq map -e 70 -a max_insert_size_for_lane -u $unmapped $mapped $bfa $bfq_1 $bfq_2
custom convert $unmapped > $unmapped.sam
samtools view $unmapped.sam > $unmapped.bam samtools view $mapped > $mapped.bam
← Fix this! what shall we do with unmapped ? look for indels ?
merge @bam using picardtools v1.08 MergeSamFiles.jar > $final.bam
samtools sort -n $final.bam $final.nameSort samtools fixmate $final.nameSort.bam $final.fm.bam samtools sort $final.fm.bam $final.fm_coordSort
Recalibrate
(citing)"After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome." ← Fix this! parameters, command line ?↓TODO
Remove Duplicates
(From Biostars:)Removing duplicates refers to multiple reads that match at the same position in the genome. This is different than one read (or read pair) mapping to multiple genome locations. MarkDuplicates finds sequence pairs that map to the same position, marking or removing the duplicates so you can work with unique pairs in downstream analyses. If you want them removed, use the REMOVE_DUPLICATES=true flag when running the program:
← Fix this! command line
java -jar MarkDuplicates.jar I=chr1.sorted.bam O=chr.markdup.bam METRICS_FILE=jeter.metrics
SNP Calling
↓TODO
Create the VCF
java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I markdup.bam -R hg18.fa -varout markdup.bam.vcf -vf VCF -pl SOLEXA
View the content of a BAM
samtools view output.sorted.bam chr1 | more
Consequences
Simple tool developed by Pierre.
Internal Sanger tool.
KG
Ensembl
- problems with track at UCSC: see https://lists.soe.ucsc.edu/pipermail/genome/2010-May/022391.html
SIFT
Polyphen
With SamTools
samtools pileup -vcf hg18.fa markdup.bam
Abbreviations
- PTR: Primary target region: exons. Regions that we wanted to target
- CTR: Capture target region (baits). Regions actually covered by baits
References
InDels
- ParMap, an algorithm for the identification of small genomic insertions and deletions in nextgen sequencing data. Khiabanian H, Van Vlierberghe P, Palomero T, Ferrando AA, Rabadan R. BMC Res Notes. 2010 May 27;3(1):147. PMID: 20507604