User:Jarle Pahr/Sequencing: Difference between revisions
Jarle Pahr (talk | contribs) |
Jarle Pahr (talk | contribs) No edit summary |
||
(8 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
http://www.ki.se/kiseq/KIGene%20troubleshooting.pdf | http://www.ki.se/kiseq/KIGene%20troubleshooting.pdf | ||
http://nextgenseek.com/2012/12/evolution-of-next-gen-sequencing-development/ | |||
Nature focus issue - sequencing technology: http://www.nature.com/nbt/journal/v30/n11/index.html | Nature focus issue - sequencing technology: http://www.nature.com/nbt/journal/v30/n11/index.html | ||
Line 189: | Line 190: | ||
==Comparisons and reviews== | ==Comparisons and reviews== | ||
http://seqbench.org/ | |||
http://link.springer.com/content/pdf/10.1007%2Fs00439-013-1321-4.pdf | http://link.springer.com/content/pdf/10.1007%2Fs00439-013-1321-4.pdf | ||
Line 601: | Line 604: | ||
=Software= | =Software= | ||
Consed graphical editor: http://bioinformatics.oxfordjournals.org/content/early/2013/08/31/bioinformatics.btt515.abstract | |||
HTseq. Sequencing analysis with Python: http://seqanswers.com/forums/showthread.php?t=4805 | |||
Bamformatics: http://sourceforge.net/projects/bamformatics/?source=directory | Bamformatics: http://sourceforge.net/projects/bamformatics/?source=directory | ||
Line 729: | Line 737: | ||
=Bibliography= | =Bibliography= | ||
What is next generation sequencing? http://ep.bmj.com/content/early/2013/08/28/archdischild-2013-304340.long | |||
SeqAnswers Literature watch: http://seqanswers.com/forums/forumdisplay.php?f=10 | |||
nar.oxfordjournals.org/content/41/1/e1.full?sid=e66b42ac-a309-47cf-8cd1-94e1229a098e#ref-12 | nar.oxfordjournals.org/content/41/1/e1.full?sid=e66b42ac-a309-47cf-8cd1-94e1229a098e#ref-12 | ||
Line 951: | Line 964: | ||
GCAT SEEK: http://lycofs01.lycoming.edu/~gcat-seek/index.html | GCAT SEEK: http://lycofs01.lycoming.edu/~gcat-seek/index.html | ||
Mason lab NGS workshop: http://chagall.med.cornell.edu/NGScourse/ | |||
=Results= | =Results= | ||
Line 961: | Line 977: | ||
http://www.citeulike.org/user/cisevol/tag/sequencing_error | http://www.citeulike.org/user/cisevol/tag/sequencing_error | ||
Churchill & Waterman 1991. The Accuracy of DNA Sequences: | |||
Estimating Sequence Quality: http://www.cmb.usc.edu/papers/msw_papers/msw-107.pdf | |||
Discovery and characterization of artifactual | Discovery and characterization of artifactual | ||
Line 972: | Line 990: | ||
Bioplanet GCAT: http://www.bioplanet.com/gcat | Bioplanet GCAT: http://www.bioplanet.com/gcat | ||
QUAST: http://bioinf.spbau.ru/quast | |||
HtSeq-Qa: http://www-huber.embl.de/users/anders/HTSeq/doc/qa.html | |||
=Economy and costs= | =Economy and costs= |
Latest revision as of 11:53, 27 January 2014
http://nucleicacids.bitesizebio.com/articles/how-to-get-great-dna-sequencing-results/
http://barricklab.org/twiki/bin/view/Lab/ProceduresPrimerDesign
http://www.ki.se/kiseq/KIGene%20troubleshooting.pdf
http://nextgenseek.com/2012/12/evolution-of-next-gen-sequencing-development/
Nature focus issue - sequencing technology: http://www.nature.com/nbt/journal/v30/n11/index.html
Companies
Applied Biosystems:
https://www2.appliedbiosystems.com/about/presskit/pdfs/celebrating_25_years_aln_article.pdf?
Pacific Biosciences: http://bio.pgp.jhu.edu/~jgreene/NextGen/presentations/PacBio_SMRT_Sequencing_Oct_2012.pdf
Technologies
For a comparison of next-generation sequencing methods, see http://en.wikipedia.org/wiki/Dna_sequencing#Next-generation_methods
See also:
SeqAnswers.com Tech summaries: http://seqanswers.com/index.php?pageid=summaries
Sanger sequencing (chain termination method)
http://users.ugent.be/~avierstr/principles/seq.html
http://www.ibt.lt/sc/files/DNASeqCG.pdf
Pyrosequencing ("454 sequencing")
Pyrosequencing is a "sequence by synthesis" method developed by Mostafa Ronaghi and Pål Nyrén at the Royal Institute of Technology, Stockholm. Sequences are determined by observation of light emission upon addition of a nucleotide complementary to the first unpaired nucleotide of the template.
Quote from Wikipedia:Pyrosequencing:
"ssDNA template is hybridized to a sequencing primer and incubated with the enzymes DNA polymerase, ATP sulfurylase, luciferase and apyrase, and with the substrates adenosine 5´ phosphosulfate (APS) and luciferin."
Sequencing proceeds as follows:
- Addition of one of the four dNTPs (dATPαS is substituted for ATP, as the former is not a substrate for luciferase). If the dNTP is complementary, DNA polyerase incorporates the nucleotide, releasing pyrophosphate (PPi).
- ATP sulfurylase catalyzes reaction of PPi and adenosine 5' phosphosulfate to create ATP
- ATP fuels luciferase-catalyzed conversion of luciferin to oxyluceferin, generating visible light.
- Unincorporated nucleotides and ATP are degraded by apyrase.
454 sequencing performs massively parallel pyrosequencing. Library DNA containing adapter sequences are adsorbed to DNA-capturing beads. The DNA bound to each bead is then amplified by emulsion-PCR, in which the beads with bound DNA are mixed with PCR reagents and emulsion oil to create a water-in-oil emulsion containing many "microreactors" consisting of beads sorrounded by water. Following PCR amplification, the DNA-binding beads are isolated and deposited into the wells of a microtiter plate. Beads with pyrosequencing enzymes are then added to the plate. Finally, the pyrosequencing is performed, processing the plate in a sequencing machine. 400 000+ DNA fragments/beads can be processed per plate.
Using "multiplex identifiers", different genomic libraries can be bar-coded, facilitating sequencing of several libraries in the same sequencing run.
Platforms:
Platform | Throughput (bases/run) | Time per run | Average (a)/mode (m) read length (nt) | Accuracy | Introduced (year) |
---|---|---|---|---|---|
GS FLX+ | 700 Mbp | 23h | Up to 1000 | 700 bp (m) | |
GS Junior | 35Mbp | 12 h | 400 | 400 bp (a) at Phred20/read |
GS FLX:
References:
Introductory paper, 454 sequencing: http://www.ncbi.nlm.nih.gov/pubmed/16056220?dopt=Abstract&holding=npg
http://www.wellcome.ac.uk/Education-resources/Education-and-learning/animations/dna/wtx056046.htm
The development and impact of 454 sequencing
Overview of 454 sequencing: http://classes.soe.ucsc.edu/bme215/Spring09/PPT/BME%20215-5.pdf
Illumina (Solexa) sequencing
http://www.illumina.com/technology/sequencing_technology.ilmn
Platform | Throughput (bases/run) (maximum) | Time per run | Read length (nt) | Accuracy | Features | Introduced (year) |
---|---|---|---|---|---|---|
MiSeq Personal Sequencer | Up to 8.5 gbp | 4 - 48 h | 250 | >70% bases higher than Q30 at read length 2 x 300 bp | ||
HiSeq 2500/1500 | 600 Gb | 2 x 100 | >80 % higher than Q30 | |||
HiSeq 2000/1000 | 300 Gb | 2 x 100 | >80 % higher than Q30 | |||
Genome Analyzer IIx | 95 Gb | 2 x 150 | >80 % higher than Q30 |
MiSeq datasheet: http://www.illumina.com/documents/products/datasheets/datasheet_miseq.pdf
Side by side comparison of Illumina sequencers: http://www.illumina.com/systems/sequencing.ilmn
Illumina - an introduction to NGS: http://www.illumina.com/Documents/products/Illumina_Sequencing_Introduction.pdf
Ion semiconductor sequencing
Ion Torrent: http://www.invitrogen.com/site/us/en/home/brands/Ion-Torrent.html?cid=fl-iontorrent Platforms:
Platform | Throughput (bases/run) | Time per run | Typical read length | Accuracy | Introduced (year) |
---|---|---|---|---|---|
Ion PGM sequencer | 10 Mb to 1Gb | 90 min+ | 35-400 bp | ||
Ion Proton sequencer | 1 human genome | 2h+ | 100 bp |
Nanopore sequencing
- Two main nanopore types: Biological nanopores (lipid membranes) and solid-state nanopores.
Biological nanopores:
Solid-state nanopores:
- Potentially easier shipping/handling (more robust) and integration with electronics.
- Technology development less advanced than for biological nanopores
Oxford Nanopore: http://www.nanoporetech.com/
http://oldwww.phys.washington.edu/groups/nanopore/
Manrao et al. 2012. reading DNA at single-nucleotide resolution with a mutant MspA nanopore and phi29 DNA polymerase: http://211.144.68.84:9998/91keshi/Public/File/49/30-4/pdf/nbt.2171.pdf
http://www.nature.com/nnano/journal/vaop/ncurrent/full/nnano.2013.71.html
- Too good to be true? Violoating laws of physics??
http://www.upenn.edu/pennnews/news/penn-research-makes-advance-nanotech-gene-sequencing-technique
Differentiation of Short, Single-Stranded DNA Homopolymers in Solid-State Nanopores: http://pubs.acs.org/doi/abs/10.1021/nn4014388
Single molecule real time sequencing (Pacific Biosciences)
Microscopical wells on a chip (zero-mode waveguides) each contain a single DNA polymerase enzyme bound to the bottom of the well, which accept a single DNA molecule as template. Fluorescent labelled dNTPs are used for DNA synthesis. Upon incorporation of a dNTP, the fluorescence tag is cleaved from the nucleotide and diffuses from the observation area within the ZMW. The sequence is determined optically by observing incorporation events.
http://www.pacificbiosciences.com/
Platforms:
PacBio RS:
http://www.pacificbiosciences.com/products/
http://www.pacificbiosciences.com/brochure
http://www.pacificbiosciences.com/pdf/Software_and_Analysis_Brochure.pdf
SOLiD sequencing (Applied Biosystems)
DNA nanoball sequencing
http://www.completegenomics.com/services/technology/
Platforms
Qiagen GeneReader
Opgen Argus: http://www.opgen.com/products-services/argus-system
Comparisons and reviews
http://link.springer.com/content/pdf/10.1007%2Fs00439-013-1321-4.pdf
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0030087
http://www.molecularecologist.com/next-gen-table-3c/
Illumina HiSeq
HiSeq 2000
HiSeq 2500
Ion Torrent
MiSeq
Ion Proton
Capillary sequencers
Applied Biosystems 3730xl : http://www.harlowscientific.com/Sequencers-ABI-3730xl-DNA-Sequencer-Harlow-Scientific
http://www6.appliedbiosystems.com/products/abi3730xlspecs.cfm
List price: $357,000.00
ABI Prism 3700: Released 1999.
Lowest observed used price: $250
ABI Prism 310:
ABI Prism 377: Released in 1995.
See also http://en.wikipedia.org/wiki/Applied_Biosystems
Concepts
K-mer
High-throughput sequence assemblers often use shorter sub-sequences (k-mers, of length k) of produced reads in the assembly process. For example, reads of 100-mers may not be expected to capture all possible 100-mers in the genome.
By breaking reads into shorter k-mers, the resulting k-mers often represent nearly all k-mers from the genome for sufficiently small k, a prerequisite for assembly using de Bruijn graphs. (http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html#bx2).
Automated K-mer selection: http://perso.eleves.bretagne.ens-cachan.fr/~chikhi/2013-july-20-hitseq.pdf
De Bruijn graph
http://en.wikipedia.org/wiki/De_Bruijn_graph
See also Compeaou et al. 2001, Nature Biotechnology - How to apply de Bruijn graphs to genome assembly: http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html
- Finding a hamiltonian cycle that visits all nodes of a graph is computationally expensive (NP-complete).
- Easier to find a cycle that visits all edges of a graph (Eulerian cycle).
- Ergo: Instead of assigning a k-mer to a node, we can assign a k-mer to an edge, allowing construction of a De Bruijn graph (http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html#bx2).
http://homolog.us/Tutorials/index.php?p=2.1&s=1
http://www.pnas.org/content/early/2012/07/25/1121464109.abstract
http://alexbowe.com/succinct-debruijn-graphs/
Bridge amplification
http://seq.molbiol.ru/sch_clon_ampl.html
RNA-Seq
http://blog.sbgenomics.com/history-of-rna-seq/
http://www.ncbi.nlm.nih.gov/pubmed/23716638?dopt=Abstract
http://en.wikipedia.org/wiki/RNA-Seq
http://seqanswers.com/forums/showpost.php?p=102911&postcount=60
SeqAnswers - posts tagged RNA seq:
http://seqanswers.com/forums/tags.php?tag=rna-seq
http://genome.cshlp.org/content/early/2011/09/07/gr.124321.111
http://www.illumina.com/technology/mrna_seq.ilmn
RNA-Seq: a revolutionary tool for transcriptomics.: http://www.ncbi.nlm.nih.gov/pubmed/19015660
Direct RNA Sequencing:
Software:
Velvet: http://en.wikipedia.org/wiki/Velvet_%28algorithm%29
Tophat: http://tophat.cbcb.umd.edu/
Cufflinks: http://cufflinks.cbcb.umd.edu/
(See also Tuxedo suite)
Genotyping by Sequencing (GBS)
http://www.maizegenetics.net/gbs-overview
ROC
See http://en.wikipedia.org/wiki/Receiver_operating_characteristic
Edit distance
See http://en.wikipedia.org/wiki/Levenshtein_distance
Color Space/2-base encoding
See
http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space.html
http://www.biostars.org/p/43855/
http://marketing.appliedbiosystems.com/images/Product_Microsites/Solid_Knowledge_MS/pdf/CSHL_Fu.pdf
See also
http://en.wikipedia.org/wiki/2_Base_Encoding
Targeted sequencing
Targeted "capturing kits" may be used to sequence a subset of genomic DNA. The human exome (as defined by the Consensus CDS (CCDS) project) totals about 38 Mb, covering about 1.22 % of the human genome
(The SureSelect Human All Exon Kit )
See also: http://massgenomics.org/2011/10/major-exome-platforms-compared.html
Scaffolding
http://genome.jgi-psf.org/help/scaffolds.html
http://seqanswers.com/wiki/How-to/scaffolding
http://bioinformatics.oxfordjournals.org/content/early/2012/04/05/bioinformatics.bts175
http://www.scfbm.org/content/7/1/4
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Paired-end reads
N50 Statistic
N50 length: In a collection of contigs, the longest length for which the subset of contigs consisting of all contigs with that length or longer contains at least half of the total of the length of the contig collection.
NG50: As N50, except that the goal is half of the total of the genome size.
http://en.wikipedia.org/wiki/N50_statistic
http://seqanswers.com/forums/showthread.php?p=41420
Haplotypes
See also:
http://hapmap.ncbi.nlm.nih.gov/originhaplotype.html.en
http://en.wikipedia.org/wiki/Haplotype
http://en.wikipedia.org/wiki/Haplogroup
Loss of Heterozygosity
http://en.wikipedia.org/wiki/Loss_of_heterozygosity
Copy number variants (CNVs)
Short Tandem Repeats (STRs)
Genotyping of STRs is used to produce forensic DNA profiles. See http://massgenomics.org/2013/01/identifying-samples-genomic-data.html
http://www.biology.arizona.edu/human_bio/activities/blackett2/str_codis.html
http://www.cstl.nist.gov/strbase/fbicore.htm
Databases
http://www.ncbi.nlm.nih.gov/gap
Sequence Read Archive: http://www.ncbi.nlm.nih.gov/sra
European Nucleotide Archive: http://www.ebi.ac.uk/ena/
Assembly and mapping
Alignment to multiple reference sequences: http://bioinformatics.oxfordjournals.org/content/29/13/i361.full
http://www.nature.com/nmeth/journal/v10/n6/full/nmeth.2474.html
http://denovoassembler.sourceforge.net/
https://github.com/sebhtml/ray
http://dskernel.blogspot.no/2013/06/open-access-doctoral-theses-on-de-novo.html
Compendium of HTS mappers: http://wwwdev.ebi.ac.uk/fg/hts_mappers/
Comparison of assemblers: http://lh3lh3.users.sourceforge.net/alnROC.shtml
SeqAnswers:Software packages for next gen sequence analysis: http://seqanswers.com/forums/showthread.php?t=43 (Thread closed since 2009)
A5: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0042304
BWA: http://bio-bwa.sourceforge.net/
BWA-MEM: http://arxiv.org/abs/1303.3997
Bowtie - An ultrafast memory-efficient short read aligner:' http://bowtie-bio.sourceforge.net/index.shtml
Bowtie 2: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
http://www.ncbi.nlm.nih.gov/pubmed/20211242
Primers and reviews:
http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_li.pdf
NCBI primer on genome assembly methods: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/assembly.shtml
Nature Biotechnology Primer - How to map billions of short reads onto genomes: http://www.nature.com/nbt/journal/v27/n5/full/nbt0509-455.html
Bioinformatics, 2012: Tools for mapping high-throughput sequencing data: http://bioinformatics.oxfordjournals.org/content/28/24/3169
A survey of sequence alignment algorithms for next-generation sequencing: http://bib.oxfordjournals.org/content/11/5/473.full
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0019175
De novo assembly:
Optimal Assembly for High Throughput Shotgun Sequencing: http://arxiv.org/abs/1301.0068
Counter-intuitevely, too high coverage can be problematic: http://seqanswers.com/forums/showthread.php?t=24965
https://github.com/lexnederbragt/denovo-assembly-tutorial/tree/master/scripts
Sequencing services
Service | Sample specification | Primer specification | Ship to | Link |
---|---|---|---|---|
GATC LightRun | Add 5 uL DNA (80-100 ng/uL plasmid or 20-80 ng/uL purified PCR product) + 5 uL 5uM (5 pmol/uL) primer to the same tube | Tm 52-58 C, 17-19 bp, (8-9 G+C for 18-mer) G or C at 3' end (max 3 Gs or Cs), maximum 4bp run. | GATC Biotech AG. European Custom Sequencing Centre. Gotrfied-Hagen-Strasse 20. 51105 Köln. | http://www.gatc-biotech.com/en/lp4/new-lightrun-sequencing.html |
Macrogen Single-pass | Add 20 uL DNA (100 ng/uL plasmid or 50 ng/uL purified PCR product) to one tube. Add 20µl primer (10 pmol/uL) to a separate tube. | 18-25 bp, 40-60 % GC, Tm 55-60 | Macrogen Europe,
IWO, Kamer IA3-195, Meibergdreef 39,1105 AZ Amsterdam Zuid-oost. Netherlands. Attention: J.S .Park. |
http://dna.macrogen.com/eng/support/seq/seq_submission.jsp |
Variant calling
Sequencing-based techniques
ChIP-sequencing
RNA-seq
Single-cell sequencing
http://trap.it/#!traps/id/f294f009-bb0f-4f14-9e59-e84cf36d2560/jump/6GoIaq61z002q6RDiaaY
Sequencing/genomics centres
http://openwetware.org/wiki/BioMicroCenter:Sequencing
New York Genome Center: http://nygenome.org/
The Genome Analysis Centre (UK): http://jobs.tgac.ac.uk/
Norwegian Cancer Genomics Consortium: http://www.cancergenomics.no/
See also: http://omicsmaps.com/
Sequencing facilities in Norway:
(Incomplete)
Oslo:
Akershus University Hospital (Ahus): 1 x Ion Torrent
Norwegian High-Throughput Sequencing Centre (NSC) Oslo, Norway: 2 x Roche/454, 1 x Illumina HiSeq, 1 x PacBio, 1 x Ion Torrent, 1 x Illumina MiSeq
Helse Sør-Øst/University of Oslo Genomics Core Facility Oslo, Norway: 1 x Illumina GA2, 1 x MiSeq, 1 x HiSeq
NTNU Genomics Core Facility Sør-Trøndelag, Norway: 1 x HiSeq
Telemark Hospital Telemark, Norway: 1 x Illumina HiSeq
Bergen:
Trondheim:
UNN:
http://www.unn.no/dna-sequencing/category11734.html
Contact persons:
Lex Nederbragt: http://contig.wordpress.com/about/
Dr. Leonardo A. Meza-Zepeda Head Helse Sør-Øst/ Univ. of Oslo Genomics Core Facility
Kjetill S. Jakobsen
Professor, Group Leader (CEES node)
Dag Erik Undlien
Professor, Group Leader (IMG node)
Other groups which employ HTS:
CIGENE, UMB. See https://sites.google.com/site/seqomics
Primers
Custom primers
Name | Length (bp) | Sequence | Tm (C) [calculated] | Tm (C) [Analytical] | GC (% / bp) | Comment |
---|---|---|---|---|---|---|
pJP-1_seq5 | 18 | CAGCGTGCGAGTGATTAT | 53.9/60.6 (2)/52.6 (3) | 50 | Binds upstream of XylS region in pSB-M1g | |
pJP-1_seq6 | 18 | AGACCACATGGTCCTTCT | 57.5° (2)/52.8 ºC(3) | 53.9 | 50 | Binds near end of GFPmut3 in pSB-M1g |
SeqMG1 | AGCAGATCCACATCCTTGAA | 62.7 (2)/53.7 (3) | Binds at nt 5672 of pSB-M1g, upstream of AgeI site. Designed to Macrogen sequencing primer criteria. | |||
pSB-SeqA | 18 | TGCAAGAAGCGGATACAG | 56 / 60.7°C (2)/52.3 ºC (3) | 50 | Binds at nt 7729 of pSB-M1g, upstream of Pm promoter and PciI site. |
Universal primers
http://www.generi-biotech.com/sequencing-universal-seguencing-primers/ http://www.synthesisgene.com/tools/Universal-Primers.pdf http://www.genewiz.com/public/universalprimers.aspx https://secure.eurogentec.com/product/research-universal-primers.html
Tm calculations:
1: CloneManager
2: Thermo Scientific
3: IDT Oligoanalyzer
A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers
http://www.biomedcentral.com/1471-2164/13/341
Software
Consed graphical editor: http://bioinformatics.oxfordjournals.org/content/early/2013/08/31/bioinformatics.btt515.abstract
HTseq. Sequencing analysis with Python: http://seqanswers.com/forums/showthread.php?t=4805
Bamformatics: http://sourceforge.net/projects/bamformatics/?source=directory
http://www.digitalbiologist.com/2013/06/python-next-gen-sequencing.html
DISCOVAR: http://www.broadinstitute.org/software/discovar/blog/
ALLPATHS-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/
Ray Cloud demo: http://browser.cloud.boisvert.info/client/?map=0§ion=3®ion=41&location=0
http://debian-med.alioth.debian.org/tasks/bio-ngs
Isaac / Illumina Open source software: https://github.com/sequencing
https://www.broad.harvard.edu/crd/wiki/index.php/Main_Page
Chromatogram viewers: http://www.dnaseq.co.uk/chrom_view.html
CodonCode aligner: http://www.codoncode.com/aligner/
BioEdit: http://www.mbio.ncsu.edu/BioEdit/bioedit.html
VCF view: http://www.easih.ac.uk/software.php
FinchTV: http://www.geospiza.com/Products/finchtv.shtml
About SCF (sequence chromatogram format) files: http://staden.sourceforge.net/manual/formats_unix_2.html
https://wiki.nci.nih.gov/display/TCGA/Sequence+trace+files
http://code.google.com/p/seqtrace/
http://www.phrap.com/background.htm
http://en.wikipedia.org/wiki/Phrap
http://www.ncbi.nlm.nih.gov/books/NBK47537/
http://www.bio.net/bionet/mm/autoseq/1999-April/001368.html
High-throughput sequencing tools:
SAM tools: http://samtools.sourceforge.net/
Burrows-Wheeler Aligner (BWA): http://bio-bwa.sourceforge.net/
http://seqanswers.com/wiki/BWA
Maq: Mapping and Assembly with Qualities
See also http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
The Genome Analysis Center - software: https://github.com/TGAC
Genome Analysis Toolkit (GATK): http://www.broadinstitute.org/gatk/
Sequencing quality and standards:
http://www.bio.net/bionet/mm/autoseq/1999-April/001366.html
http://en.wikipedia.org/wiki/Phred_quality_score
Sequencing projects
http://www.microbe.net/undergraduate-research-built-environment-genomes/
File formats
FASTG: http://fastg.sourceforge.net/
Sequence Alignment/Map (SAM) format: "A generic format for storing large nucleotide sequence alignments". Tab-delimited text format consisting of a header section (optional) and an alignment section.
http://samtools.sourceforge.net/
http://samtools.sourceforge.net/SAM1.pdf
See also:
http://compbio.soe.ucsc.edu/sam.html
http://www.ncbi.nlm.nih.gov/pubmed/19505943
http://seqanswers.com/wiki/SAM
Binary Compressed Sam format/Binary Alignment Format (BAM):
Binary, compressed file format containing the same information as SAM files.
From https://wiki.nci.nih.gov/display/TCGA/Binary+Alignment+Map : "Centers align sequence reads to a reference genome to produce a Sequence Alignment Map (SAM) format file. The SAM file is then converted into a binary form, or Binary-sequence Alignment Format (BAM) file"
See also http://genome.ucsc.edu/goldenPath/help/bam.html
Variant Call Format (VCF):
Standard created by the 1000 Genomes Project.
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
From http://www.ensembl.org/info/website/upload/large.html :
"The VCF format is a tab delimited format for storing variant calls and and individual genotypes. It is able to store all variant calls from single nucleotide variants to large scale insertions and deletions."
ABI (Applied Biosystems) format:
FASTQ:
FASTQ files encode identified nucleotides together with their corresponding quality scores. The interpretation of the quality scores may vary depending on the source of the sequence, but the most used is the "Sanger format" (Phred quality scores).
http://en.wikipedia.org/wiki/FASTQ_format
http://maq.sourceforge.net/fastq.shtml
http://www.bioperl.org/wiki/FASTQ_sequence_format
The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants: http://nar.oxfordjournals.org/content/38/6/1767.full
SRA:
Bibliography
What is next generation sequencing? http://ep.bmj.com/content/early/2013/08/28/archdischild-2013-304340.long
SeqAnswers Literature watch: http://seqanswers.com/forums/forumdisplay.php?f=10
nar.oxfordjournals.org/content/41/1/e1.full?sid=e66b42ac-a309-47cf-8cd1-94e1229a098e#ref-12
Assembly of large genomes using second-generation sequencing.: http://www.ncbi.nlm.nih.gov/pubmed/20508146?dopt=Abstract&holding=npg
http://online.liebertpub.com/doi/full/10.1089/cmb.2011.0201
Comparison of variant-calling software
http://www.nature.com/nmeth/journal/v6/n11s/abs/nmeth.1376.html
http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1935.html
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data: http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2474.html
http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html
GAGE: A critical evaluation of genome assemblies and assembly algorithms: http://genome.cshlp.org/content/22/3/557
2011
Miller 2011 - Assembly Algorithms for Next-Generation Sequencing Data: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646/
2012
An Integrated Pipeline for de Novo Assembly of Microbial Genomes : http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0042304
2013
Genome sequencing and next-generation sequence data analysis: A comprehensive compilation of bioinformatics tools and databases: http://www.scirp.org/journal/PaperInformation.aspx?PaperID=30744
Harnessing Virtual Machines to simplify next generation DNA sequencing analysis: http://bioinformatics.oxfordjournals.org/content/early/2013/06/20/bioinformatics.btt352.abstract
High-throughput sequencing for biology and medicine: http://www.nature.com/msb/journal/v9/n1/full/msb201261.html
DNA sequencing using electrical conductance measurements of a DNA polymerase: http://www.nature.com/nnano/journal/vaop/ncurrent/full/nnano.2013.71.html
Li et al.: Memory Efficient Minimum Substring Partitioning: http://www.vldb.org/pvldb/vol6/p169-li.pdf
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.: http://www.ncbi.nlm.nih.gov/pubmed/23644548
Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly : http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0062856
Commentary
The state of NGS variant calling - Don't panic: http://blog.goldenhelix.com/?p=1725
Assemblies: The good, the bad and the ugly: http://www.nature.com/nmeth/journal/v8/n1/full/nmeth0111-59.html
A tale of three next generation sequencers: http://www.biomedcentral.com/content/pdf/1471-2164-13-341.pdf
http://core-genomics.blogspot.no/2012_08_01_archive.html
http://ivory.idyll.org/blog/thoughts-on-assemblathon-2.html
http://tidsskriftet.no/article/2928729
Nextgenseek: http://nextgenseek.com/
Slides
Courses
http://www.forbio.uio.no/events/courses/2013/radseq.html
http://genomics.no/oslo/index.php?page=courses
http://ged.msu.edu/angus/bioinformatics-courses.html
Procedures and troubleshooting
Results/success rates:
Sample | DNA | Primer | Result |
---|---|---|---|
Barcode | - | - |
Sanger sequencing:
http://seqcore.brcf.med.umich.edu/doc/dnaseq/trouble/badseq.html
Depending on economy and available sample amounts, consider sequencing each sample twice to more easily discern between possible sequencing errors and actual mutations.
Suggested work procedure when receiving sanger sequencing results (plasmids, etc.):
- Firstly, open the chromatogram file to asses the read length and overall quality.
- If applicable, compare the automatically trimmed sequenc (.fas) file and the expected sequence using BLAST or another sequence alignment tool. OR: consider using raw sequence copied from a chromatogram viewer.
- If no hit is found, make sure that the most permissive algorithm (blastn or similar) is used. If still no hit is found, manually inspect the chromatogram (.abi) file using a chromatogram viewer. If the trimmed file is small compared to the raw sequence (low chromatogram quality) and the remainder appears sensible, re-do the search using "raw" called bases (copied directly from the chromatogram viewer). When making notes on sequence results, always write which sequence (PHRED-generated, "raw" sequence from chromatogram viewer?) which was used for a given analysis (f. ex. BLAST search). Otherwise, confusion may ensue: Note says 100 % match, BLAST search gives no/bad match, etc....
- As a quick check, the sequence file can be searched for a short portion of the expected sequence, while allowing for some mistmatches (which may be present because of sequencing errors).
- If disrepancies occur, inspect the chromatogram at the relevant positions.
- If a hit is found for the desired sequence, check that the sequence is in the right position, and that the flanking sequences are correct.
- Be aware that alignment may produce suboptimal results (indicating a worse fit than is actually the case), especially when aligning to circular sequences.
- If the chromatogram yields no sequence, note/report this as "no usable data".
Three main "concerns" may appears:
- Base differs from expected.
- Base is uncalled ("n")
- Indel/Gap
In all cases inspecting the chromatogram may resolve the issue. Automatically generated sequences should be considered a best guess by the computer.
Chromatogram interpretation:
http://peter.unmack.net/molecular/data/chromatogram.editing.html
http://www.sci.sdsu.edu/dnacore/InterpretData.html
http://cancer-seqbase.uchicago.edu/traces.html
http://seqcore.brcf.med.umich.edu/doc/dnaseq/interpret.html
Common causes of bad data from sanger sequencing:
- Salt/alcohol/other contamination
- GC rich of palindromic regions.
- Double priming
- Supression of signal after a strong signal: Happens most commonly for G's after A's, and often for G's after C's. Most often, weak G signals follow after multiple A's.
Common causes of mis-called bases:
- Unevenly spaced peaks in the chromatogram may lead the program to insert a non-existing, ambigious base ("n"). Some sequencing machines (http://seqcore.brcf.med.umich.edu/doc/dnaseq/interpret.html) have been known to give excess spacing between the peaks in "GA".
- In the beginning portion of the sequence (~first 50 bases), two bases are often called as one (http://peter.unmack.net/molecular/data/chromatogram.editing.html).
Template preparation:
http://www.sci.sdsu.edu/dnacore/tempprep.html
Misc
Blueseq online sequencing guide: http://www.blueseq.com/
http://lycofs01.lycoming.edu/~gcat-seek/
Bitesize bio NGS channel: http://nxseq.bitesizebio.com/
http://nxseq.bitesizebio.com/articles/a-short-history-of-sequencing-part-2-the-first-of-the-next/
Genome in a bottle consortium: http://genomeinabottle.org/
SEQanswers: http://seqanswers.com/
SEQanswers wiki: http://seqanswers.com/wiki/SEQanswers
SEQansers - how to: http://seqanswers.com/wiki/How-to
Genome Reference Consortium: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
List of NGS blogs: http://seqanswers.com/forums/showthread.php?t=5024
NGS Necropolis: http://blueseq.com/knowledgebank/ngs-necropolis/
Rob Carlson's blog: http://synthesis.cc/
http://titojankowski.com/the-500000-dna-sequencer-tear-down/
Raw data
Guides and instructional material
PRACTICAL: Genome sequencing of Bacteroides isolates. http://www.nematodes.org/teaching/gg3/index.shtml
ANGUS: http://ged.msu.edu/angus/
See also http://ivory.idyll.org/blog/ngs-course-with-aws.html
NSU NGS Analysis workshop 2012: http://ged.msu.edu/angus/tutorials-2012/index.html
http://ged.msu.edu/angus/tutorials-2012/files/lecture3-mapping.pptx.pdf
MSU NGS analysis workshop 2013: http://ged.msu.edu/angus/tutorials-2013/index.html
Homolog.us tutorials: http://www.homolog.us/Tutorials/index.php?p=1.1&s=1
http://ged.msu.edu/angus/tutorials-2013/files/rayan-2013-june-18-msu.pdf
NGS WikiBook: http://en.wikibooks.org/wiki/Next_Generation_Sequencing_%28NGS%29
GCAT SEEK: http://lycofs01.lycoming.edu/~gcat-seek/index.html
Mason lab NGS workshop: http://chagall.med.cornell.edu/NGScourse/
Results
Quality control, error sources and error detection
http://www.citeulike.org/user/cisevol/tag/sequencing_error
Churchill & Waterman 1991. The Accuracy of DNA Sequences: Estimating Sequence Quality: http://www.cmb.usc.edu/papers/msw_papers/msw-107.pdf
Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation: http://nar.oxfordjournals.org/content/early/2013/01/08/nar.gks1443.full.pdf?keytype=ref&ijkey=suYBLqdsrc7kH7G
10 rules of thumb in genomics: http://genomeinformatician.blogspot.co.uk/2011/07/10-rules-of-thumb-in-genomics.html
Bioplanet GCAT: http://www.bioplanet.com/gcat
QUAST: http://bioinf.spbau.ru/quast
HtSeq-Qa: http://www-huber.embl.de/users/anders/HTSeq/doc/qa.html
Economy and costs
http://www.genome.gov/sequencingcosts/
Read simulation
http://sourceforge.net/projects/readsim/?source=directory