Home Contact Internal Lab Members Protocols Publications Research Talks

get Reference Genome

Download the hg18/build36 from UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes

get SamTools

http://samtools.sourceforge.net/

Index the genome for samtools

 samtools faidx hg18.fa

Align FASTQs vs the reference

With BWA

http://bio-bwa.sourceforge.net

Merge all the reference sequences into one fasta file hg18.fasta (? ← Fix this! need merge ?)

Index the reference genome:

  bwa index -a bwtsw hg18.fasta

Pre mapping

← Fix this! why do we need a pre-mapping ?

extract every 500th read from fastq file
Align one fastq files

-l Take the first INT subsequence as seed

-q Parameter for read trimming.

 bwa aln -l 32 -q 15 -foutput1.aln hg18.fasta file1.fastq.gz
 bwa aln -l 32 -q 15 -foutput2.aln hg18.fasta file2.fastq.gz

Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly.

 bwa sampe hg18.fasta output1.aln output2.aln file1.fastq.gz file2.fastq.gz >  output.sam

export to bam ?

 samtools view output.sam >  output.bam

← Fix this! I used this ? Use the reference genome indexed by samtools

 samtools import hg18.fa.fai output.sam output.bam
 samtools sort output.bam output.bam.sorted
 samtools index chr1.sorted.bam

sort bam

 samtools sort output.bam sorted_prefix

do insert size stats e.g. 99.8 percentile for MAQ max insert size ← Fix this! what does that mean ?

With MAQ

http://maq.sourceforge.net/

index the reference genome

 maq fasta2bfa hg18.fa hg18.bfa

split fastq files in to chunks of 1 million PE reads ← Fix this! why?
- foreach fastq file :

 maq fastq2bfq $fq $bfq
 maq map -e 70 -a max_insert_size_for_lane -u $unmapped $mapped $bfa $bfq_1 $bfq_2

custom convert $unmapped > $unmapped.sam

  samtools view $unmapped.sam > $unmapped.bam
  samtools view $mapped > $mapped.bam

← Fix this! what shall we do with unmapped ? look for indels ?

merge @bam using picardtools v1.08 MergeSamFiles.jar > $final.bam

 samtools sort -n $final.bam $final.nameSort
 samtools fixmate $final.nameSort.bam $final.fm.bam
 samtools sort $final.fm.bam $final.fm_coordSort

Recalibrate

(citing)"After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome." ← Fix this! parameters, command line ?↓TODO

Remove Duplicates

(From Biostars:)Removing duplicates refers to multiple reads that match at the same position in the genome. This is different than one read (or read pair) mapping to multiple genome locations. MarkDuplicates finds sequence pairs that map to the same position, marking or removing the duplicates so you can work with unique pairs in downstream analyses. If you want them removed, use the REMOVE_DUPLICATES=true flag when running the program:

← Fix this! command line

java -jar MarkDuplicates.jar I=chr1.sorted.bam  O=chr.markdup.bam METRICS_FILE=jeter.metrics

SNP Calling

↓TODO

With SamTools

  samtools pileup -vcf hg18.fa  markdup.bam

Create the VCF

 java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -I markdup.bam -R hg18.fa  -varout markdup.bam.vcf -vf VCF  -pl SOLEXA

View the content of a BAM

   samtools view output.sorted.bam chr1 | more

Abbreviations

PTR: Primary target region: exons. Regions that we wanted to target
CTR: Capture target region (baits). Regions actually covered by baits

Other tools

FASTX-Toolkit:a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing

RRedon:Protocols/Variation pipeline

Contents

get Reference Genome

get SamTools

Index the genome for samtools

Align FASTQs vs the reference

With BWA

Pre mapping

With MAQ

Recalibrate

Remove Duplicates

SNP Calling

With SamTools

Create the VCF

View the content of a BAM

Abbreviations

Other tools

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools