Revision as of 05:02, 20 September 2014

Before we start

Teacher

Darek Kedra (darek dot kedra at gmail dot com)

I work at CRG (Centre for Genomic Regulation) in Barcelona, Spain, in Cedric Notredame (author of t-coffee ligner) group. Our work place looks like this: CRG image

My current research topics include RNASeq, genome assembly and annotation, SNP calling, lncRNA discovery. As for the parasites I am working with sequences from L.donovani, L.major and T.brucei.

Organization etc.

lets turn out /silence the cell phones
some exposure to command line interface / Unix is assumed (cut and paste of commands will be OK)
in case of Unix command line problems check the page from the other course [here]
this page will stay online after the course "forever", and we will improve it over time (old version will be accessible through History link on the top)
you are welcome to register as a wiki user and comment in the Talk page/fix any errors/ ask for enhancements
in the boxed parts of the page lines starting with # are comments, and what below are examples of commands. After typing equivalents of these in your terminal, press Enter

Why Next Generation Sequencing

Typical uses cases of NGS data include:

de novo genome assembly
mapping reads to a known genome
1. whole genome random reads (DNA)
2. targeted resequencing: regions selected by long range PCR/hybridisation
3. exome mapping: DNA fragments are preselected using microarrays/beads with attached exonic sequences, so only fragments hybridising to them are selected
4. RNASeq: just RNA, but specific protocols may target just 5- or 3-prime ends of transcripts or miRNA with small size selection. RNASeq is the application in which stranded libraries are used, i.e. to differentiate between sense and antisense transcripts
5. ChIP-Seq: chromatin immunoprecipitation combined with sequencing of DNA fragments bound by histones/transcription factors.
6. bisulfite sequencing (non-methylated cytosines converted to uracil, read as Ts on the non-methylated sequences)
metagenomics: sequencing populations of microorganisms to investigate the diversity of species/strains

Whole Genome Sequencing

Tips before you start

if possible, use haploid genome (see bee genome project where they used haploid drones)
next best thing: highly inbred, nearly homozygous lines

Sometimes you have unexpected treasures: There is probably (almost)-homozygous cow: http://en.wikipedia.org/wiki/Chillingham_cattle

there are cases where extra-chromosomal DNA (40-70 chloroplasts per plant cell are often in 150kb range, with 40-70 copies per organelle) contributes non-trivial portion of total DNA. Select tissues/stages with less multiple copy DNAs

Technologies to consider

Simple bacteria can be sequenced using just Illumina data (still preferably paired 2x100bp or more).
If you sequence anything in the range of 30kb (fungi /kinetoplasts/etc.) often containing repeats exceeding length of reads, are unlikely to give you satisfactory assembly (we tried it with 2x100bp at >700x coverage => shattered genome, unusable for annotation
the likely solution is to use much larger read lengths (even 454 @400bp can help, but longer sequences are much better)
1. PacBio is a current leader, giving you bulk of your reads in range 3-4kbs, good enough for smaller genome
2. Moleculo from Illumina (not sure about availability in your region) is able to get high 10kb fragments using modification of Illumina technology
on the low cost front, one can:
1. optimize preparation of Illumina sequencing protocol targeting i.e. low GC genome (Plasmodium falciparum, 25%GC)
2. use overlapping ends long reads Illumina library (i.e 2x 100bp with 180bp inserts).
3. since the longest to date illumina reads are from low throughput MiSeq machine:
  1. protocol V3, 2 × 300 bp ~55 hrs 13.2-15 Gb
it is possible to greatly improve scaffolding (which contigs comes after another and in which orientation) using lower coverage 2k, 5k 10k or sequencing fosmid ends
one can try to reduce complexity of the assembly by subselecting parts of the genome
1. use flow sorted chromosomes as input DNA
2. sequence pools of fosmid clones, likely delivered from different parts of the genome

Bioinformatics of genome assembly

The results of Asemblathon2, an attempt to evaluate performance of number of assemblers and teams were rather unsatisfactory. There is no clear winner, performing best on all 3 species used (fish, parrot and a snake), and the programs seem to make different choices: N50 vs accuracy of the assembly. You can not have both at this time.

This reflects rather poor quality of multiple sequenced genomes, from 30Mbp Leishmania to vertebrates. And this is often made worse by over-enthusiastic annotation (the "hypothetical unlikely" genes, retrotransposons, long ORFs in ribosomal DNA.

The idea is to perform multiple assemblies using different k-mer lengths, programs, error correction programs and then evaluate what makes most sense. Keep in mind that simply getting a longer assembly with longer N50 make lead to removing some coding parts, like was case of "improvement" of highly polymorphic sea squirt Ciona intestinalis genome.

A good shot from biology point of view, is determining to which level the set of extremely conserved genes is present in the different assemblies. This was pioneered by Ian Korf with program called CEGMA using ca 400 supposedly conserved in all Eukaryotes. Probably the better approach would be to create lineage specific conserved genes (i.e. genes having clear orthologues in all sequenced fungi) .

There is expansion of tasks performed by "assemblers", doing often their own read filtering, sometimes error correction, and scaffolding. Since these may expected intact FASTQ files as input, there is no one clear path what you should do with your sequences prior to assembly.

Interesting approach is to normalize the k-mer frequencies present in your reads prior to assembly, as proposed by Titus Brown with his khmer program: http://khmer.readthedocs.org/en/v1.1/

The existing assemblers often fail for regions showing either very high or very low coverage. By making the k-mer numbers more uniform seems to improve some assemblies.

According to http://bioinformatics.oxfordjournals.org/content/early/2014/07/08/bioinformatics.btu391.full the recommended approach is to use:

ABySS assembler with Illumina PE data
ALLPATHS-LG or SPAdes assembler with Illumina PE-MP

Just check the exact library requirements of these assemblers (ALLPATHS requires overlapping ends l, plus long inserts libraries).

NGS file formats overview

There are multiple file formats used at various stages of NGS data processing. We can divide them into two basic types:

text based (FASTA, FASTQ, SAM, GTF/GFF, BED, VCF, WIG)
binary (BAM, BCF, SFF(454 sequencer data))

In principle, we can view and manipulate text based formats without special tools, but we will need these to access and view binary formats. To make things a bit more complicated, the text-based format are often compressed to save space making them de facto binary, but still easy to read by eye using standard Unix tools. Also despite that one can read values in several columns and from tens of rows, we still need dedicated programs to make sense of millions of rows and i.e. encoded columns.

On the top of these data/results files, some programs require that for faster access we need a companion file (often called index). See i.e.

FASTA (.fa & .fai)
BAM and BAI formats (suffixes .bam & .bai),
VCF (.vcf & .vcf.idx).

The important thing for all files containing positions on chromosomes (mappings, features) is their numbering. To make things complicated, we have formats starting with 1, or with 0.

1 based formats: GFF/GFT, SAM/BAM, WIG,
0 BED, CN (copy number data, not covered)

Programs automatically deduce the correct numbering scheme, but whenever you are converting from one format to another watch for this "feature".

More info on file formats used by IGV: http://www.broadinstitute.org/igv/?q=book/export/html/16

Fasta (just few tricks)

#count number of sequences in multiple fasta
grep -c "^>" multi_fasta.fa

#list sequence names in fasta, display screen by screen with "q" to escape:
grep "^>" | less

While most likely mature NGS programs will handle complex FASTA sequence names correctly, you may sometimes have problems with various scripts. To fix it and just keep the first, hopefully unique single word sequence name:

#!/usr/bin/env python
"""
name: fix_fasta_header.py
usage: ./fix_fasta_header.py input_fasta.fa > output_fasta.fa

CAVEAT: it does not check if the resulting sequence names are unique
"""
import sys

fn = sys.argv[1]

for line in open(fn).readlines():
	if line[0] == ">":
		line = line.split()[0]
		print line
	else:
		print line,

For reformatting sequences so that all sequences will have 70 columns you can use this program from a package called exonerate from: https://github.com/nathanweeks/exonerate

fastareformat input.fa > output.fa

For extracting sequences from your FASTA file you can use pyfasta library (or samtools, see below):

#single contig/chromosome
pyfasta extract --header --fasta Lmex_genome.sort.fa LmxM.01 > LmxM.01_single.fa

#list of sequences in the desired order
#input seqids.txt file (just two lines with sequence names):
LmxM.01
LmxM.00

#commad to extract
pyfasta extract --header --fasta Lmex_genome.sort.fa --file seqids.txt  > LmxM.two_contigs.fa

We can obtain fasta index (which contains contig sizes):

#index
samtools faidx Lmex_genome.sort.fa

#once we created .fai file for a given .fa file, we can extract a sequence(s):
#whole single chromosome:
samtools faidx Lmex_genome.sort.fa LmxM.01 > LmxM.01_single.fa

#some region

>LmxM.01:10000-20000 with the coorrect fasta sequence name (>LmxM.01:10000-20000)
samtools  faidx Lmex_genome.sort.fa LmxM.01:10000-20000 > LmxM.01_10000-20000.fa 

#Above faidx tip thanks to: XY Student (name to be found)

#get the chromosomal sizes file:
awk '{print $1"\t"$1} ' Lmex_genome.sort.fa.fai > Lmex_genome.sort.sizes

FASTQ

Format and quality encoding checks

Already in the 90ties when all sequencing was being done using Sanger method, the big breakthrough in genome assembly was when individual bases in the reads (ACTG) were assigned some quality values. In short, some parts of sequences had multiple bases with a lower probability of being called right. So it makes sense that matches between high quality bases are given a higher score, be it during assembly or mapping that i.e. end of the reads with multiple doubtful / unreliable calls. This concept was borrowed by Next Generation Sequencing. While we can hardly read by eye the individual bases in some flowgrams, it is still possible for the Illumina/454/etc. software to calculate base qualities. The FASTQ format, (usually files have suffixes .fq or .fastq) contains nowadays 4 lines per sequence:

sequence name (should be unique in the file)
sequence string itself with ACTG and N
extra line starting with "+" sign, which contained repeated sequence name in the past
string of quality values (one letter/character per base) where each letter is translated in a number by the downstream programs

Here it is how it looks:

@SRR867768.249999 HWUSI-EAS1696_0025_FC:3:1:2892:17869/1
CAGCAAGTTGATCTCTCACCCAGAGAGAAGTGTTTCATGCTAAGTGGCACTGGTGCAGAACAGTTCTGCAATGG
+
IHIIHDHIIIHIIIIIIHIIIDIIHGGIIIEIIIIIIIIIIIIGGGHIIIHIIIIIIBBIEDGGFHHEIHGIGE

CAVEAT: When planning to do SNP calling using GATK, do not try to modify part of the name describing location on the sequencing lane:

HWUSI-EAS1696_0025_FC:3:1:2892:17869

This part is used for looking for optical replicates.

Unfortunately Solexa/Illumina did not follow the same quality encoding as people doing Sanger sequencing, so there are few iterations of the standard, with quality encodings containing different characters. For the inquisitive: http://en.wikipedia.org/wiki/FASTQ_format#Quality

What we need to remember from it, that we must know which quality encoding we have in our data, because this is an information required by mappers, and getting it wrong will make our mappings either impossible (some mappers may quit when encountering wrong quality value) or at best unreliable.

There are two main quality encodings: Sanger and Two other terms, offset 33 and offset 64 are also being used for describing quality encodings:

offset 33 == Sanger / Illumina 1.9
offset 64 == Illumina 1.3+ to Illumina 1.7

For that, if we do not have direct information from the sequencing facility which version of the Illumina software was used, we can still find it out if we investigate the FASTQ files themselves. Instead of going by eye, we use a program FastQC. For the best results/full report we need to use the whole FASTQ file as an input, but for quick and dirty quality encoding recognition using 100K of reads is enough:

head -400000 my_reads.fastq > 100K_head_my_reads.fastq 
fastqc 100K_head_my_reads.fastq
#we got here 100K_head_my_reads.fastq_fastqc/ directory

grep Encoding 100K_head_my_reads.fastq_fastqc/fastqc_data.txt

#output: 
Encoding	Sanger / Illumina 1.9

CAVEAT: all this works only on unfiltered FASTQ files. Once you remove the lower quality bases/reads containing them, guessing which encoding format is present in your files is problematic.

Here is a bash script containing awk oneliner to detect quality encoding in both gzip-ed and not-compressed FASTQ files.

#!/bin/bash
file=$1

if [[ $file ]]; then
    command="cat"
    if [[ $file =~ .*\.gz ]];then
        command="zcat"
    fi
    command="$command $file | "
fi

command="${command}awk 'BEGIN{for(i=1;i<=256;i++){ord[sprintf(\"%c\",i)]=i;}}NR%4==0{split(\$0,a,\"\");for(i=1;i<=length(a);i++){if(ord[a[i]]<59){print \"Offset 33\";exit 0}if(ord[a[i]]>74){print \"Offs
et 64\";exit 0}}}'"

eval $command

Types of data

read length

from 35bp in some old Illumina reads to 250+ in MiSeq. The current sweet spot is between 70-150bp.

single vs paired

Just one side of the insert sequenced or sequencing is done from both ends. Single ones are cheaper and faster to produce, but paired reads allow for more accurate mapping, detection of large insertions/deletions in the genome.

Most of the time forward and reverse reads facing each other end-to-end are

insert length

With the standard protocol, the inserts are anywhere between 200-500bp. Sometimes especially for de novo sequencing, insert sizes can be smaller (160-180bp) with 100bp long reads allowing for overlap between ends of the reads. This can improve the genome assembly (i.e. when using Allpaths-LG assembler requiring such reads). Also with some mappers (LAST) using longer reads used to give better mappings (covering regions not unique enough for shorter reads) than 2x single end mapping. With paired end mappings the effects are modest.

Program for combining overlapping reads: FLASH: http://ccb.jhu.edu/software/FLASH/ To run it:

flash SRR611712_1.fastq SRR611712_2.fastq -o SRR611712_12.flash

For improving the assembly or improving the detection of larger genome rearrangements there are other libraries with various insert sizes, such as 2.5-3kb or 5kb and more. Often sequencing yields from such libs are lower than from the conventional ones.

stranded vs unstranded (RNASeq only)

We can obtain reads just from a given strand using special Illumina wet lab kits. This is of a great value for subsequent gene calling, since we can distinguish between overlapping genes on opposite strands.

quality checking (FastQC)

It is always a good idea to check the quality of the sequencing data prior to mapping. We can analyze average quality, over-represented sequences, number of Ns along the read and many other parameters. The program to use is FastQC, and it can be run in command line or GUI mode.

good quality report:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

bad quality FastQC report

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

trimming & filtering

Depending on the application, we can try to improve the quality of our data set by removing bad quality reads, clipping the last few problematic bases, or search for sequencing artifacts, as Illumina adapters. All this makes much sense for de novo sequencing, were genome assemblies can be improved by data clean up. It has a low priority for mapping, especially when we have high coverage. Bad quality reads etc. will simply be discarded by the mapper.

You can read more about quality trimming for genome assembly in the two blog posts by Nick Loman:

http://pathogenomics.bham.ac.uk/blog/2013/04/adaptor-trim-or-die-experiences-with-nextera-libraries/

Trimmomatic

web: http://www.usadellab.org/cms/index.php?page=trimmomatic

version (Sep 2014): 0.32

From the manual:

Paired End:

java -jar /path/to/trimmomatic-0.32.jar PE --phred33 \
input_forward.fq.gz input_reverse.fq.gz \
output_forward_paired.fq.gz output_forward_unpaired.fq.gz \
output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters Remove leading low quality or N bases (below quality 3) Remove trailing low quality or N bases (below quality 3) Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 Drop reads below the 36 bases long

Single End:

java -jar /path/to/trimmomatic-0.32.jar SE --phred33 \
input.fq.gz output.fq.gz \
ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the same steps, using the single-ended adapter file

Tagdust (for simple unpaired reads)

Tagdust is a program for removing Illumina adapter sequences from the reads containing them. Such reads containing 6-8 bases not from genome will be impossible to map using typical mappers having often just 2 mismatch base limit. Tagdust works in an unpaired mode, so when using paired reads we have to "mix and match" two outputs to allow for paired mappings.

tagdust -o my_reads.clean.out.fq -a my_reads.artifact.out.fq adapters.fasta my_reads.input.fq

Error correction

For some applications, like de novo genome assembly, one can correct the sequencing errors in the reads by comparing them with other reads with almost identical sequence. One of the programs which do perform this and are relatively easy to install and make it running is Coral.

Coral

web site: http://www.cs.helsinki.fi/u/lmsalmel/coral/ version: 1.4

It requires large RAM machine for correcting individual Illumina files (run it on 96GB RAM).

#Illumina reads
./coral -fq input.fq -o output.fq  -illumina  

#454 reads
./coral -fq input.454.fq -o output.454.fq  -454  

#correcting 454 reads with Illumina reads
Coral can not use more than one input file, therefore one has to combine Illumina & 454 reads into one FASTQ file, noting the number of Illumina reads used. To prepare such file:

cat 10Millions_illumina_reads.fq > input_4_coral.fq
cat some_number_of_454_reads.fq >> input_4_coral.fq

## run coral with the command:
coral -454 -fq  input_4_coral.fq -o  input_4_coral.corrected.fq-i 10000000 -j 10000000

#This will correct just the 454 reads,not the Illumina ones
#In real life you have to count the number of reads in your Illumina FASTQ file, i.e. (assuming you do not have wrapped sequence/qualities FASTQ 1 read=4lines ) :

wc -l illumina_reads.fq | awk '{print $1/4}' 

#if in doubt, use fastqc to get the numbers

Public FASTQ data: Short Read Archive vs ENA

While we will often have our data sequenced in house/provided by collaborators, we can also reuse sequences made public by others. Nobody does everything imaginable with their data, so it is quite likely we can do something new and useful with already published data, even if treating it as a control to our pipeline. Also doing exactly the same thing, say assembling genes from RNASeq data but with a newer versions of the software and or more data will likely improve on the results of previous studies. There are two main places to get such data sets:

NCBI Short Read Archive / Taxonomy Browser:
- SRA: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj

For both of them it is a good idea to install program Aspera Connect to reduce download times:

http://asperasoft.com/software/transfer-clients/connect-web-browser-plug-in/

Click on Resources, download and install the web plugin.

* go there 
* put Leishmania major RNA
* we get "SRA Experiments	6"	
* click on "6"
* one of them is: "Transcriptome Analysis of Leishmania major strain Friedlin", ERX190352

Taxonomy Browser http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

* go there
* put Leishmania major
* click on the checkbox SRA Experiments
* click on Display button
* we got 58 public experiment, 7of which are RNA
* click on RNA (7) on the left
* note 
"Whole Genome Sequencing of Leishmania major strain Friedlin"
Accession: SRX203187

CAVEAT: not everybody submits the right descriptions of their experiments. In case of doubt, download and map.

Extracting the FASTQ from SRA archive is easy with sra-toolkit installed on your computer You get it here: https://github.com/ncbi/sratoolkit

fastq-dump SRR1016916.sra

#Result:
SRR1016916.fastq

You can also get the multiple SRA files in one step and without 20 clicks from NCBI with configured ASPERA by writing a shell script (one line per file), preferably using a (Python/Perl/Ruby) script and a list of files to get:


ascp -QTrk1 -l200m -i /home/me/.aspera/connect/etc/asperaweb_id_dsa.openssh \
anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/sra/SRP/SRP012/SRP012154/  ./

ascp -QTrk1 -l200m -i /home/me/.aspera/connect/etc/asperaweb_id_dsa.openssh \
anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/sra/SRP/SRP012/SRP012155/  ./

Which one to use? ENA may be easier as you get gzipped fastq files directly. But NCBI tools may have better interface at times, so you can search for interesting data set at NCBI, then store the names of experiments and download fastq.gz from ENA.

SAM and BAM file formats

The SAM file format serves to store information about result of mapping of reads to the genome. It starts with a header, describing the format version, sorting order (SO) of the reads, genomic sequences to which the reads were mapped. Numbering starts from base 1, contains header, and it is line oriented (one sequence per line).

The minimal version looks like this:

@HD	VN:1.0	SO:unsorted
@SQ	SN:1	LN:171001
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0

It can contain both mapped and unmapped reads (we are mostly interested in mapped ones). Here is the example:

#unmapped reads:
SRR594632.1	4	*	0	0	*	*	0	0	NTGAGGACAAAGCTCCGCTGCTGCTCGCTTTCCCCT	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
SRR594632.2	4	*	0	0	*	*	0	0	NTGAGCGAACTATTGATATTCGGCATAGAACAAAAA	BQKKHNMKOR___T_TTRWWVTVTVVTTOV___QQQ
SRR594632.3	4	*	0	0	*	*	0	0	NTGTTATTCAATCTGAGCATCGTTATCTTGACCATG	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
SRR

#mapped reads
SRR594632.20	0	LmxM.20	3139684	25	36M	*	0	0	NTGAAAATAAAGAACATCTGAACATTTCTCTGTAAG	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBXT:A:U	NM:i:2	X0:i:1	X1:i:0	XM:i:2	XO:i:0	XG:i:0	MD:Z:0A0A34
SRR594632.21	16	LmxM.22	312482	25	36M	*	0	0	ATACAAGGCACAAAAATAAAGCAGTCGAGAGAGCAN	BBBBBBBBBBBBBBBB__________RRRRRLLLQBXT:A:U	NM:i:2	X0:i:1	X1:i:0	XM:i:2	XO:i:0	XG:i:0	MD:Z:34T0G0
SRR594632.22	16	LmxM.01	33973	25	36M	*	0	0	AACACTGTGCGTGTGCGCTTGTCTAAGATGAGCCAN	_WTRWWTOTVTTVPTT__________PPLOPJJJKBXT:A:U	NM:i:2	X0:i:1	X1:i:0	XM:i:2	XO:i:0	XG:i:0	MD:Z:34T0A0

In short, it is a complex format, where in each line we have detailed information about the mapped read, its quality mapped position(s), strand, etc. The exact description of it takes (with BAM and examples) 15 pages: http://samtools.sourceforge.net/SAMv1.pdf Use BAMs instead of SAM files (speed of access, size, compatibility with tools.There are multiple tools to process and extract information from SAM and its compressed form, BAM files. The few used on daily/basis:

samtools
Picard
bamtools
bedtools

Typical tasks:

#view the BAM file
samtools view your_bam | less

#view the header of BAM file
samtools view -H your_bam | less

#sort the BAM/SAM file
java -jar /path/to/SortSam.jar I=mapped_reads.unsorted.sam O=mapped_reads.sorted.bam \
SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true

#merge (sorted) BAM files:
java -jar /path/to/MergeSamFiles.jar \
I=mapped_reads.set1.bam \
I=mapped_reads.set2.bam \
I=mapped_reads.set3.bam \
O=merged_mapped_reads.sets1_2_3.bam
SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true ASSUME_SORTED=true

#extracting as BAM reads mapped to certain chromosome (LmxM.01):
bamtools filter -in ERR307343_12.Lmex.bwa_mem.Lmex.bwa.sorted.bam  \
-out chr_1_ERR307343_12.Lmex.bwa_mem.Lmex.bam -region LmxM.01

GTF/GFF

The most commonly used sequence annotation format with few flavors.

GTF Fields:

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.
source - name of the program that generated this feature, or the data source (database or project name)
feature - feature type name, e.g. Gene, Variation, Similarity
start - Start position of the feature, with sequence numbering starting at 1.
end - End position of the feature, with sequence numbering starting at 1.
score - A floating point value.
strand - defined as + (forward) or - (reverse).
frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

Sample GFF output from Ensembl export:

X	Ensembl	Repeat	2419108	2419128	42	.	.	hid=trf; hstart=1; hend=21
X	Ensembl	Repeat	2419108	2419410	2502	-	.	hid=AluSx; hstart=1; hend=303
X	Ensembl	Repeat	2419108	2419128	0	.	.	hid=dust; hstart=2419108; hend=2419128
X	Ensembl	Pred.trans.	2416676	2418760	450.19	-	2	genscan=GENSCAN00000019335
X	Ensembl	Variation	2413425	2413425	.	+	.	
X	Ensembl	Variation	2413805	2413805	.	+	.

And from TriTrypDB (gff3)

LmjF.01 TriTrypDB       CDS     12073   12642   .       -       0       ID=cds_LmjF.01.0040-1;Name=cds;description=.;size=570;Parent=rna_LmjF.01.0040-1
LmjF.01 TriTrypDB       exon    12073   12642   .       -       .       ID=exon_LmjF.01.0040-1;Name=exon;description=exon;size=570;Parent=rna_LmjF.01.0040-1
LmjF.01 TriTrypDB       gene    15025   17022   .       -       .       ID=LmjF.01.0050;Name=LmjF.01.0050;description=carboxylase%2C+putative;size=1998;web_id=LmjF.01.0050;locus_tag=LmjF.01.0050;size=1998;Alias=321438056,389592315,LmjF1.0050,LmjF01.0050,LmjF.01.0050,LmjF01.0050:pep,LmjF01.0050:mRNA

The detailed description of GFF3: http://www.sequenceontology.org/gff3.shtml

IGV-centered CAVEAT: GFF files downloaded from various sources often contain more annotations than a single track of IGV can handle. Typical offenders are chromosome/scaffold lines, causing non-visibility of gene annotations. Since the authors are quite inventive in naming their tracks, plus key words like "chromosome" may be in column 9, you can:

#see what is there:
grep -v "^#" TriTrypDB-8.0_LmexicanaMHOMGT2001U1103.gff | awk '{print $3}' | sort | uniq -c | sort -n 

#result TriTrypDB-8.0_LmexicanaMHOMGT2001U1103.gff:
     1 random_sequence
     16 rRNA
     34 chromosome
     83 tRNA
    714 transcript
   8250 mRNA
   8336 CDS
   9063 gene
   9149 exon
 908081

As you can see, we can try to remove just 34+1 (chromosome + random_sequence) and check that we will get 908081-35 entries when running it the same command on the result. Just remember that we removed the header, which we may want to keep.

BED

Another text annotation file format from UCSC, compatible with its Genome Browser: http://genome.ucsc.edu/cgi-bin/hgGateway

Full format description: http://genome.ucsc.edu/FAQ/FAQformat.html#format1

CAVEAT: first base in contig/chromosome is numbered as 0

Essential 3 columns

1: chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).

2: chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.

3: chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

#just regular genomic intervals for i.e GC calculations 
LmxM.01	0	10000
LmxM.01	10000	20000
LmxM.01	20000	30000
LmxM.01	30000	40000

Often used next 3 columns

4: name - Defines the name of the BED line. I

5: score - A score between 0 and 1000.

6: strand - Defines the strand - either '+' or '-'.

#just regular genomic intervals for i.e GC calculations 
LmxM.01 210     720     Rep1    100     -
LmxM.01 800     1234    ABC|cds 1000    +
LmxM.01 2100    3030    DEF|gen	100	+

Last 6 columns for extended features

7: thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.

8: thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).

9: itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line.

10: blockCount - The number of blocks (exons) in the BED line.

11: blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.

12: blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

VCF

Stands for Variant Call Format. Text file format for storing information about SNPs, insertions and deletions. It is rather complex, see detailed description: http://www.1000genomes.org/node/101

Obtaining variants listed in such form is a multistep procedure, but well standardised. See GATK section below.

Here are the examples from our data:

#simple SNP
LmxM.01	17218	.	G	A	1818.77	.	AC=2;AF=1.00;AN=2;BaseQRankSum=2.761;DP=55

#insertion
LmxM.01	38177	.	C	CTG	367.73	.	AC=1;AF=0.500;AN=2;BaseQRankSum=0.082;DP=34

#deletion
LmxM.01	99318	.	GCACACACA	G	1774.73	.	AC=2;AF=1.00;AN=2;DP=30

For using it with IGV, we have to have VCF index, created by GATK.

WIG

An another line oriented format for storing and displaying numerical information along the genome for intervals. The generic form of WIG:

line defining type of this section, chr_position, start, step and more
number1
number2

The intervals can be regular, as when we compute some measure across the whole genome. This is fixedStep type:

fixedStep chrom=LmxM.01 start=0 step=10000 span=10000
0.572100
0.612700
0.637300
0.628000
0.637200
0.643700
0.642800

Above fixedStep WIG is from GC calculations in a given window in the genome. Procedure described here (we start from "create a N-base wide step file": http://wiki.bits.vib.be/index.php/Create_a_GC_content_track

When the calculated number different from 0 are sparse in our genomic intervals, it is simpler to use variableStep

variableStep	chrom=LmxM.17	span=41
1	0.142857
variableStep	chrom=LmxM.17	span=249
43	0.142857
variableStep	chrom=LmxM.17	span=4
292	0.166667
variableStep	chrom=LmxM.17	span=7
296	0.142857

This variableStep example comes from gem-mapability program from GEM-Mapper.

#index the genome (fix FASTA headers first to one word sequence names)
gem-indexer -i Lmex_genome.fa -o Lmex_genome

#calculate mapability for 100bp fragments
gem-mappability -I Lmex_genome.gem -o Lmex_genome -l 100 

#convert to WIG
gem-2-wig -I Lmex_genome.gem -i Lmex_genome.mappability -o Lmex_genome.mappability

#Result (loadable in IGV):
Lmex_genome.mappability.wig

Wikiomics:EMBO Tunis 1: Difference between revisions