Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-2-27

From OpenWetWare
Revision as of 21:42, 26 February 2007 by ChristopherNabel (talk | contribs) (update assignment)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Assignment Due 2/27

Code and Embedded Comments

#!/usr/bin/env python



# PART 1: import the GenBank tools necessary to complete this analysis

# To import the GenBank and Clustal Modules
import os
from Bio import GenBank, Seq, Clustalw
from Bio.Align import AlignInfo
from sets import Set


# To parse sequence data from the GenBank Entry
seq_parser = GenBank.FeatureParser()

# To interface to Genbank
ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = seq_parser)



# PART 2: load our ApoE as a reference string for sequence comparison

# Download ApoE's record
parsed_ref = ncbi_dict['178850']

# Extract the sequence and save as a string
ref = parsed_ref.seq.tostring()
print "ApoE has been loaded as the reference gene"



# PART 3: Input other sequences for comparison
# How many sequences to import?
x = int(raw_input("How many sequences would you like to upload for comparison?  "))
# Now import them into a list...
comparison_seqs = []
for i in range(0,x):
    g = int(raw_input("Please enter the GenBank ID for a sequence of interest  "))
    entry = ncbi_dict[str(g)]
    entry_seq = entry.seq.tostring()
    comparison_seqs.append(entry_seq)

#####################################
# As I revise this code, I think I may want to rethink conversion of GenBank ID to string
# to fasta file.  As this website shows (http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html)
# I can probably go straight from GenBank to Fasta.  I could probably write some algorithm
# that takes this list of GenBank IDs and processes them into Fasta files.  With my existing code,
# the only revision I see necessary is the extraction of the ApoE record into string form, rather than
# a Fasta file.


                 

# PART 4: Align the Sequences
# As I continue to revise this script, I will need to change the following line to tolerate new inputs
cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'Apoe.fasta'))
cline.set_output('test.aln')
alignment = Clustalw.do_alignment(cline)



# PART 5: Analyze and Print the results
# Determine insertions, deletions, and point mutations in the DNA Sequence
for i in range(alignment.get_alignment_length()):
        column = alignment.get_column(i)
        colset = Set()
        for j in range(len(column)):
                colset.add(column[j])
                card = len(colset)
                if card != 1:
                        if '-' in colset:
                            if '-' in column[0]:
                                print "Insertion at basepair ", i, column
                            else:
                                print "Deletion at basepair ", i, column
                        else:
                                print "Point Mutation at basepair ", i, column


# I'm not quite sure how I would design a code that would identify an insertion.  An insertion means that you
# enter a relationship where one strand is lagging behind another one, but there is still sequence conservation
# between the two sequences.  I'm pretty sure that, in the case of insertion, our reference strand would show a
# deletion compared to the sequence of interest.  So, to this end, if '-' were the first character inserted into
# our set, we would know that we'd have an insertion.  

# Also, I have not had the opportunity to learn translation quite yet, so the issue of translational consequences
# is not addressed in my code.  I will plan to tackle this at the tutorial on Wednesday.  The other issue that
# remains unresolved is the establishment of a reading frame and length.  This issue as well will be resolved
# Wednesday.