Revision as of 21:15, 7 October 2010

Notice: The Wilke Lab page has moved to http://wilkelab.org.
The page you are looking at is kept for archival purposes and will not be further updated.

THE WILKE LAB

Home Contact People Research Publications Materials

The wonders of Biopython

Virtually all of this information comes from the excellent but tome-like Biopython Tutorial. Consult it to make your wishes come true.

Parsing Genbank Files

Biopython is an amazing resource if you don't feel like figuring out how to parse a bunch of different idiosyncratic sequence formats (fasta,fastq,genbank, etc).

Here I focus on parsing Genbank files; SeqIO can be used to parse a bunch of different formats, but the structure of the parsed data will vary.

Let's say you want to go through every gene in an annotated genome and pull out all the genes with some specific characteristic (say, we have no idea what they do). This problem is pretty easy once you know how to use Biopython's data structures.

For this example I will be using the E.coli K12 genome, which clocks in at around 13 mbytes.

To begin, we need to load the parser and parse the genbank file. It should only take a couple seconds

from Bio import SeqIO
genome=SeqIO.read('CP000948.gbk','genbank') #you MUST tell SeqIO what format is being read

Use SeqIO.read if there is only one genome (or sequence) in the file, and SeqIO.parse if there are multiple sequences. Since we're using genbank files, there typically (I think) only be a single giant sequence of the genome.

There are a bunch of data objects associated to the parsed file. The main one of interest will be the features object, which is a list of all the annotated features in the genome file.

>>>genome.features[:10] #prints a short description of the first ten features
[SeqFeature(FeatureLocation(ExactPosition(0),ExactPosition(4686137)), type='source', strand=1),
 SeqFeature(FeatureLocation(ExactPosition(189),ExactPosition(255)), type='gene', strand=1),
 SeqFeature(FeatureLocation(ExactPosition(189),ExactPosition(255)), type='CDS', strand=1),
 SeqFeature(FeatureLocation(ExactPosition(336),ExactPosition(2799)), type='gene', strand=1),
 SeqFeature(FeatureLocation(ExactPosition(336),ExactPosition(2799)), type='CDS', strand=1),
 SeqFeature(FeatureLocation(ExactPosition(336),ExactPosition(2790)), type='misc_feature', strand=1),
 SeqFeature(FeatureLocation(ExactPosition(336),ExactPosition(1224)), type='misc_feature', strand=1),
 SeqFeature(FeatureLocation(ExactPosition(348),ExactPosition(1134)), type='misc_feature', location_operator='order', strand=1),
 SeqFeature(FeatureLocation(ExactPosition(348),ExactPosition(1035)), type='misc_feature', location_operator='order', strand=1),
 SeqFeature(FeatureLocation(ExactPosition(447),ExactPosition(717)), type='misc_feature', location_operator='order', strand=1)]

Features themselves have lots of information as well. Annotation data is contained in a dictionary called qualifiers; typical information will be 'product' (for genes), 'gene' (name) , and 'note' for misc. crap.

ie say you want info on the 10th feature:

feature=genome.features[10]
annotation=feature.qualifiers

To select only coding sequences (type='CDS'), all you need to do is

for feature in genome.features:
    if feature.type=='CDS':
        DO STUFF

A simple example for selecting specific types of genes.

predicted_genes=[]
for i,feature in enumerate(genome.features):
    if feature.type=='CDS':
        if 'product' in feature.qualifiers: #verify it codes for a real protein (not pseudogene)
            tot_genes+=1
            #data in qualifiers are all lists, even if only 1 string, so [0] converts to string
            #use lower() to make sure weirdly capitilized strings get selected as well
            product=feature.qualifiers['product'][0].lower()

            if product=='predicted protein':
                num_predicted+=1
                predicted_genes.append(feature)

Finally, grabbing the sequence associated with a feature is now pretty easy. You previously had to do extra work if the gene was on the opposite strand.

 
seq=feature.extract(genome.seq)
protein=seq.translate() #can also use optional switches like cds=True to raise errors if not a proper coding sequence (lacks start and stop codons)

Pitfalls encountered thus far

As of Biopython ???, feature.extract(genome.seq) incorporates strandedness. Thus, older version of Biopython or sequence slices obtained other than the extract function will give garbled information.

With a little extra work you can use the location information associated with each feature to see what to do. Biopython sometimes seems to be designed to emulate a Russian nesting doll, so there are objects within objects that you need to mess with for this part.

sequence=feature.extract(genome.seq) # simple if you just want the gene sequence; works on antisense strand genes
#things get trickier if you want information outside of gene
location=feature.location
#position gives the actual python slice location of the start and end. need to reverse if on negative strand
#could use to get upstream or downstream regions
start=location.start.position
end=location.end.position

Grabbing genomes from Genbank

You can use Biopython's Entrez module to grab individual genomes. You MUST provide your email so Entrez can email you if you start overloading their servers before they block you. The id used can be pretty much any identifier, such as the acession, the accession version, the genbank id, etc.

Before you go hog wild, read the rules!

Here they are, reproduced:

   * Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests.
   * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov, not the standard NCBI Web address.
   * Make no more than 3 requests every 1 second.
   * Use the URL parameter email, and tool for distributed software, so that we can track your project and contact you if there is a problem.
   * NCBI's Disclaimer and Copyright notice must be evident to users of your service.

The big one is the first one. Biopython by default complies with rules 2,3 and 4.

The id used can be pretty much any identifier, such as the acession, the accession version, the genbank id, etc. From the eFetch documentation : Record Identifier Current values:

   * NCBI sequence number (GI)
   * accession
   * accession.version
   * fasta
   * GeneID
   * genome ID
   * seqid

from Bio import Entrez
Entrez.email = 'whatev@mail.com'
handle=Entrez.efetch(db='genome',id='NC_009925',rettype='gb') # Accession id works, returns genbank format, looks in the 'genome' database
#store locally
local_file=open('NC_009925','w')
local_file.write(handle.read())
handle.close()
local_file.close()

Wilke:Parsing Genbank files with Biopython: Difference between revisions

Revision as of 21:15, 7 October 2010

Contents

The wonders of Biopython

Parsing Genbank Files

Pitfalls encountered thus far

Grabbing genomes from Genbank

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools