Latest revision as of 01:58, 18 June 2010

Home Contact Internal Lab Members Protocols Publications Research Talks

get Reference Genome

See main article: RRedon:Protocols/Variation_pipeline/Reference_genome

Tools

SAMTOOLS http://samtools.sourceforge.net/
MAQ
BWA

Align FASTQs vs the reference

Mapping quality =30 = GOOD Mapping quality is

  p(alignment is incorrect)* phred_qual)

function of

repeats on refseq
base qual of the read
alignment parameters
paired read are better

BWA vs MAQ

repeat : (citing Biostarts) "Maq maps a repeat read randomly"
Mapping quality: (citing [ Biostars]):"If you want to find the SNPs, you do not really need to care about this. Maq will consider the mapping quality in genotype calling. If you want to pinpoint the structual variations with paired end reads, you should only pick up abnormal pairs with high mapping qualities (30, for example). If you are analysing ChIP-Seq data, setting a threshold on mapping quality may also be necessary."
dans le papier de Li and Durbin "Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform" il est dit que MAQ surestime la mapping quality = proba(alignement incorrect) . BWA=a true hit can always be found. ← Fix this! english

With BWA

See main article: BWA

With MAQ

See main article about MAQ

Recalibrate

(citing)"After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome." ← Fix this! parameters, command line ?↓TODO

Remove Duplicates

(From Biostars:)Removing duplicates refers to multiple reads that match at the same position in the genome. This is different than one read (or read pair) mapping to multiple genome locations. MarkDuplicates finds sequence pairs that map to the same position, marking or removing the duplicates so you can work with unique pairs in downstream analyses. If you want them removed, use the REMOVE_DUPLICATES=true flag when running the program:

← Fix this! command line

java -jar MarkDuplicates.jar I=chr1.sorted.bam  O=chr.markdup.bam METRICS_FILE=jeter.metrics

Coverage

← Fix this! With GATK ?

← Fix this! 'min depth': how do we use it , when ?

SNP Calling

↓TODO

samtools

 samtools pileup -vcf ${HG18}.fa file.bam |\
  samtools/misc/samtools.pl varFilter -d 4 -D 1200 -Q 25

d: Minimum depth is 4x (around 8x is recommended)
D: Maximum depth is 1200x (but no more than around 3x mean is recommended)
Genotype quality score >= 20 ← Fix this! in varFilter ?
Snp quality score >= 25 ← Fix this! in varFilter ?
Q: RMS mapping quality >= 25 (Root Mean Square)

varFilter can also be used to keep one SNP in a 10pb window: cf. option in samtools/misc/samtools.pl varFilter

Varscan

see http://varscan.sourceforge.net/using-varscan.html

Create the VCF

 java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I markdup.bam -R hg18.fa  -varout markdup.bam.vcf -vf VCF  -pl SOLEXA

Indels

View the content of a BAM

   samtools view output.sorted.bam chr1 | more

Consequences

Simple tool developed by Pierre.

Internal Sanger tool.

KG

PhdSNP

Panther

Ensembl

problems with track at UCSC: see https://lists.soe.ucsc.edu/pipermail/genome/2010-May/022391.html

SIFT

See main article SIFT

Polyphen

See main article Polyphen

With SamTools

  samtools pileup -vcf hg18.fa  markdup.bam

Abbreviations

PTR: Primary target region: exons. Regions that we wanted to target
CTR: Capture target region (baits). Regions actually covered by baits
PCCR : Union of CTR and PTR regions
CTRplus :Capture Target Regions ± 100bp flank

References

InDels

ParMap, an algorithm for the identification of small genomic insertions and deletions in nextgen sequencing data. Khiabanian H, Van Vlierberghe P, Palomero T, Ferrando AA, Rabadan R. BMC Res Notes. 2010 May 27;3(1):147. PMID: 20507604

Other tools

FASTX-Toolkit:a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing

@@ Line 1: / Line 1: @@
 {{RRedon}}
 =get Reference Genome=
-Download the hg18/build36 from UCSC: [http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/ http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes]
-=get SamTools=
+See '''main article''': [[RRedon:Protocols/Variation_pipeline/Reference_genome]]
-[http://samtools.sourceforge.net/ http://samtools.sourceforge.net/]
-=Recalibration=
+=Tools=
-([http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration citing])"After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome."
+* SAMTOOLS [http://samtools.sourceforge.net/ http://samtools.sourceforge.net/]
-{{fix-this|should be processed after alignment on reference ?}}{{to-do|}}
+* [[RRedon:Protocols/Variation_pipeline/MAQ|MAQ]]
+* [[RRedon:Protocols/Variation_pipeline/BWA|BWA]]
 =Align FASTQs vs the reference=
+'''Mapping quality''' =30 = '''GOOD'''
+Mapping quality is
+   p(alignment is incorrect)* phred_qual)
+function of
+* repeats on refseq
+* base qual of the read
+* alignment parameters
+* paired read are better
+==BWA vs MAQ==
+* '''repeat''' : (citing [http://biostar.stackexchange.com/questions/661 Biostarts]) "Maq maps a repeat read randomly"
+* '''Mapping quality''': (citing [ Biostars]):"If you want to find the SNPs, you do not really need to care about this. Maq will consider the mapping quality in genotype calling. If you want to pinpoint the structual variations with paired end reads, you should only pick up abnormal pairs with high mapping qualities (30, for example). If you are analysing ChIP-Seq data, setting a threshold on mapping quality may also be necessary."
+* dans le papier de Li and Durbin "Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform" il est dit que MAQ surestime la mapping quality = proba(alignement incorrect) . BWA=a true hit can always be found. {{fix-this|english}}
 ==With BWA==
-* [http://bio-bwa.sourceforge.net http://bio-bwa.sourceforge.net]
+See main article: [[RRedon:Protocols/Variation_pipeline/BWA|BWA]]
-Merge all the reference sequences into one fasta file hg18.fasta (? {{fix-this|need merge ?}})
+==With MAQ==
-Index the reference genome:
+See main article about [[RRedon:Protocols/Variation pipeline/MAQ|MAQ]]
-   bwa index -a bwtsw hg18.fasta
-===Pre mapping===
+=Recalibrate=
-{{fix-this|why do we need a pre-mapping ?}}
+([http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration citing])"After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome."
-* extract every 500th read from fastq file
+{{fix-this|parameters, command line ?}}{{to-do|}}
-* Align one fastq files
-'''-l'''  Take the first INT subsequence as seed
-'''-q'''  Parameter for read trimming.
-  bwa aln -l 32 -q 15 -foutput1.aln hg18.fasta file1.fastq.gz
-  bwa aln -l 32 -q 15 -foutput2.aln hg18.fasta file2.fastq.gz
-* Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly.
+=Remove Duplicates=
+(From [http://biostar.stackexchange.com/questions/789/about-paired-end-sequencing/790#790 Biostars]:)Removing duplicates refers to multiple reads that match at the same position in the genome. This is different than one read (or read pair) mapping to multiple genome locations.
+MarkDuplicates finds '''sequence pairs that map to the same position''', marking or removing the duplicates so you''' can work with unique pairs in downstream analyses'''. If you want them removed, use the REMOVE_DUPLICATES=true flag when running the program:
-  bwa sampe hg18.fasta output1.aln output2.aln file1.fastq.gz file2.fastq.gz >  output.sam
-* export to bam
+{{fix-this|command line}}
-  samtools view output.sam >  output.bam
+ java -jar MarkDuplicates.jar I=chr1.sorted.bam  O=chr.markdup.bam METRICS_FILE=jeter.metrics
-* sort bam
-  samtools sort output.bam sorted_prefix
+=Coverage=
+{{fix-this|With GATK ?}}
-do insert size stats e.g. 99.8 percentile for MAQ max insert size {{fix-this|what does that mean ?}}
+{{fix-this|'min depth': how do we use it , when ?}}
+=SNP Calling=
+{{to-do|}}
-==With MAQ==
+==samtools==
-* [http://maq.sourceforge.net/ http://maq.sourceforge.net/]
+  samtools pileup -vcf ${HG18}.fa file.bam |\
+   samtools/misc/samtools.pl varFilter -d 4 -D 1200 -Q 25
+* d: Minimum depth is 4x    (around 8x is recommended)
+* D: Maximum depth is 1200x (but no more than around 3x mean is recommended)
+* Genotype quality score >= 20 {{fix-this|in varFilter ?}}
+* Snp quality score >= 25 {{fix-this|in varFilter ?}}
+* Q: RMS mapping quality >= 25 (Root Mean Square)
-maq fasta2bfa $ref_fa $ref_bfa
+varFilter can also be used to keep one SNP in a 10pb window: cf. option in samtools/misc/samtools.pl varFilter
+==Varscan==
+see http://varscan.sourceforge.net/using-varscan.html
-* split fastq files in to chunks of 1 million PE reads {{fix-this|why?}}
+==Create the VCF==
-** foreach fastq file :
-   maq fastq2bfq $fq $bfq
+   java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I markdup.bam -R hg18.fa  -varout markdup.bam.vcf -vf VCF  -pl SOLEXA
-  maq map -e 70 -a max_insert_size_for_lane -u $unmapped $mapped $bfa $bfq_1 $bfq_2
+==Indels==
+* http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0
+* [[RRedon:Protocols/PINDEL|PINDEL]]
+* [[RRedon:Protocols/DINDEL|Dindel]] http://sites.google.com/site/keesalbers/soft/dindel
-custom convert $unmapped > $unmapped.sam
+=View the content of a BAM=
-   samtools view $unmapped.sam > $unmapped.bam
+    samtools view output.sorted.bam chr1 | more
-   samtools view $mapped > $mapped.bam
-{{fix-this|what shall we do with unmapped ? look for indels ?}}
+=Consequences=
-merge @bam using picardtools v1.08 MergeSamFiles.jar > $final.bam
+Simple tool developed by Pierre.
-  samtools sort -n $final.bam $final.nameSort
+Internal Sanger tool.
-  samtools fixmate $final.nameSort.bam $final.fm.bam
-  samtools sort $final.fm.bam $final.fm_coordSort
-=Remove Duplicates=
-(From [http://biostar.stackexchange.com/questions/789/about-paired-end-sequencing/790#790 Biostars]:)Removing duplicates refers to multiple reads that match at the same position in the genome. This is different than one read (or read pair) mapping to multiple genome locations.
-MarkDuplicates finds '''sequence pairs that map to the same position''', marking or removing the duplicates so you''' can work with unique pairs in downstream analyses'''. If you want them removed, use the REMOVE_DUPLICATES=true flag when running the program:
+==KG==
-{{fix-this|command line}}
+==PhdSNP==
- java -jar MarkDuplicates.jar I=chr1.sorted.bam  O=chr.markdup.bam METRICS_FILE=jeter.metrics
+==Panther==
+==Ensembl==
+* problems with track at UCSC: see [https://lists.soe.ucsc.edu/pipermail/genome/2010-May/022391.html https://lists.soe.ucsc.edu/pipermail/genome/2010-May/022391.html]
+==SIFT==
+See main article [[RRedon:Protocols/SIFT|SIFT]]
-=SNP Calling=
+==Polyphen==
-{{to-do|}}
+See main article [[RRedon:Protocols/Polyphen|Polyphen]]
 ==With SamTools==
     samtools pileup -vcf hg18.fa  markdup.bam
-==Create the VCF==
-  java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I markdup.bam -R hg18.fa  -varout markdup.bam.vcf -vf VCF  -pl SOLEXA
 =Abbreviations=
 * '''PTR''': Primary target region: exons.   Regions that we wanted to target
 * '''CTR''': Capture target region (baits). Regions actually covered by baits
+* '''PCCR''' : Union of CTR and PTR regions
+* '''CTRplus''' :Capture Target Regions ± 100bp flank
+=References=
+==InDels==
+* ParMap, an algorithm for the identification of small genomic insertions and deletions in nextgen sequencing data. Khiabanian H, Van Vlierberghe P, Palomero T, Ferrando AA, Rabadan R. BMC Res Notes. 2010 May 27;3(1):147. [http://www.ncbi.nlm.nih.gov/pubmed/20507604 PMID: 20507604]
 =Other tools=
 * [http://hannonlab.cshl.edu/fastx_toolkit/ FASTX-Toolkit:a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing]

RRedon:Protocols/Variation pipeline: Difference between revisions

Latest revision as of 01:58, 18 June 2010

Contents

get Reference Genome

Tools

Align FASTQs vs the reference

BWA vs MAQ

With BWA

With MAQ

Recalibrate

Remove Duplicates

Coverage

SNP Calling

samtools

Varscan

Create the VCF

Indels

View the content of a BAM

Consequences

KG

PhdSNP

Panther

Ensembl

SIFT

Polyphen

With SamTools

Abbreviations

References

InDels

Other tools

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools