RRedon:Protocols/Variation pipeline: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
m (1)
Line 52: Line 52:


=SNP Calling=
=SNP Calling=
==With SamTools==
  samtools pileup -vcf hg18.fa  markdup.bam
==Create the VCF==
  java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I markdup.bam -R hg18.fa  -varout markdup.bam.vcf -vf VCF  -pl SOLEXA


=Abbreviations=
=Abbreviations=

Revision as of 02:44, 31 May 2010

Home        Contact        Internal        Lab Members        Protocols        Publications        Research        Talks       


get Reference Genome

Download the hg18/build36 from UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes

get SamTools

http://samtools.sourceforge.net/

Align FASTQs vs the reference

With BWA

Merge all the reference sequences into one fasta file hg18.fasta (? ← Fix this! need merge ?) Index the reference genome:

  bwa index -a bwtsw hg18.fasta 

Pre mapping

← Fix this! why do we need a pre-mapping ?

  • extract every 500th read from fastq file
  • Align one fastq files

-l Take the first INT subsequence as seed

-q Parameter for read trimming.

 bwa aln -l 32 -q 15 -foutput1.aln hg18.fasta file1.fastq.gz
 bwa aln -l 32 -q 15 -foutput2.aln hg18.fasta file2.fastq.gz
  • Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly.
 bwa sampe hg18.fasta output1.aln output2.aln file1.fastq.gz file2.fastq.gz >  output.sam 
  • export to bam
 samtools view output.sam >  output.bam 
  • sort bam
 samtools sort output.bam sorted_prefix 

do insert size stats e.g. 99.8 percentile for MAQ max insert size ← Fix this! what does that mean ?


With MAQ

Remove Duplicates

(From Biostars:)Removing duplicates refers to multiple reads that match at the same position in the genome. This is different than one read (or read pair) mapping to multiple genome locations. MarkDuplicates finds sequence pairs that map to the same position, marking or removing the duplicates so you can work with unique pairs in downstream analyses. If you want them removed, use the REMOVE_DUPLICATES=true flag when running the program:


← Fix this! command line

java -jar MarkDuplicates.jar I=chr1.sorted.bam  O=chr.markdup.bam METRICS_FILE=jeter.metrics


SNP Calling

With SamTools

  samtools pileup -vcf hg18.fa  markdup.bam

Create the VCF

 java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I markdup.bam -R hg18.fa  -varout markdup.bam.vcf -vf VCF  -pl SOLEXA

Abbreviations

  • PTR: Primary target region: exons. Regions that we wanted to target
  • CTR: Capture target region (baits). Regions actually covered by baits

Other tools