Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-2-27
From OpenWetWare
Assignment Due 2/27
Code and Embedded Comments
#!/usr/bin/env python # PART 1: import the GenBank tools necessary to complete this analysis # To import the GenBank and Clustal Modules import os from Bio import GenBank, Seq, Clustalw from Bio.Align import AlignInfo from sets import Set # To parse sequence data from the GenBank Entry seq_parser = GenBank.FeatureParser() # To interface to Genbank ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = seq_parser) # PART 2: load our ApoE as a reference string for sequence comparison # Download ApoE's record parsed_ref = ncbi_dict['178850'] # Extract the sequence and save as a string ref = parsed_ref.seq.tostring() print "ApoE has been loaded as the reference gene" # PART 3: Input other sequences for comparison # How many sequences to import? x = int(raw_input("How many sequences would you like to upload for comparison? ")) # Now import them into a list... comparison_seqs = [] for i in range(0,x): g = int(raw_input("Please enter the GenBank ID for a sequence of interest ")) entry = ncbi_dict[str(g)] entry_seq = entry.seq.tostring() comparison_seqs.append(entry_seq) ##################################### # As I revise this code, I think I may want to rethink conversion of GenBank ID to string # to fasta file. As this website shows (http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html) # I can probably go straight from GenBank to Fasta. I could probably write some algorithm # that takes this list of GenBank IDs and processes them into Fasta files. With my existing code, # the only revision I see necessary is the extraction of the ApoE record into string form, rather than # a Fasta file. # PART 4: Align the Sequences # As I continue to revise this script, I will need to change the following line to tolerate new inputs cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'Apoe.fasta')) cline.set_output('test.aln') alignment = Clustalw.do_alignment(cline) # PART 5: Analyze and Print the results # Determine insertions, deletions, and point mutations in the DNA Sequence for i in range(alignment.get_alignment_length()): column = alignment.get_column(i) colset = Set() for j in range(len(column)): colset.add(column[j]) card = len(colset) if card != 1: if '-' in colset: if '-' in column[0]: print "Insertion at basepair ", i, column else: print "Deletion at basepair ", i, column else: print "Point Mutation at basepair ", i, column # I'm not quite sure how I would design a code that would identify an insertion. An insertion means that you # enter a relationship where one strand is lagging behind another one, but there is still sequence conservation # between the two sequences. I'm pretty sure that, in the case of insertion, our reference strand would show a # deletion compared to the sequence of interest. So, to this end, if '-' were the first character inserted into # our set, we would know that we'd have an insertion. # Also, I have not had the opportunity to learn translation quite yet, so the issue of translational consequences # is not addressed in my code. I will plan to tackle this at the tutorial on Wednesday. The other issue that # remains unresolved is the establishment of a reading frame and length. This issue as well will be resolved # Wednesday.