Harvard:Biophysics 101/2007/Notebook:Resmi Charalel/2007-2-20

From OpenWetWare

Jump to: navigation, search

Script

 #!/usr/bin/env python
 
 import os
 from Bio import GenBank, Seq
 from Bio.Seq import Seq,translate
 from Bio import Clustalw
 from Bio.Clustalw import MultipleAlignCL
 from Bio.Align import AlignInfo
 from sets import Set
 
 mut = []
 gap =[]
 cold=[]
 colm=[]
 
 cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'apoe.fasta'))
 cline.set_output('test.aln')
 alignment = Clustalw.do_alignment(cline)
 
 summary_align = AlignInfo.SummaryInfo(alignment)
 
 for i in range(alignment.get_alignment_length()):
     col = alignment.get_column(i)
     s = Set() # create a new set
     for c in range(len(col)):
         s.add(col[c]) # add each column element to the set
     if '-' in s:
         gap.append(i)
         cold.append(col)
         ss=s.copy()
         ss.remove('-')
         if len(ss)>1:
             mut.append(i)
             colm.append(col)
     elif len(s)>1:
         mut.append(i)
         colm.append(col)
     
 for i in range(len(gap)):
     print 'Deletion', cold[i], 'at', gap[i]
 for i in range(len(mut)):
     print 'Mismatch', colm[i], 'at', mut[i]
 
 numseq=[]
 orfs=[]
 proteins=[]
 seq=[]
 numseq=alignment.get_all_seqs()
 
 record_parser = GenBank.FeatureParser()
 
 ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = record_parser)  
 
 parsed_record = ncbi_dict['178850']
 s = parsed_record.seq.tostring()
 t=0
 
 for i in range(len(numseq)):
     se=numseq[i]
     seq.append(se.seq.tostring())
     start = seq[i].find('ATG')
     orf = 
     c=start
 
     for x in range(len(s)-start-4):
         orf = orf + s[c]
         c= c +1
         length = c-start
         remainder=length%3
         if remainder == 0:
             codon=s[c]+s[c+1]+s[c+2]
             if codon== 'TAA' or codon=='TAG' or codon=='TGA':
                 orf=orf+s[c+1]+s[c+2]
                 break
         orfs.append(orf)
         proteins.append(translate(orfs[t]))
         t=t+1
 
 sil=0
 cha=0
 
 for p in range((len(proteins))-1):
     if proteins[p+1]==proteins[0]:
         sil=sil+1
     else:
         cha=cha+1
         
 print 'There are', sil, 'amino acids that are contain either no mutations or only silent mutations.'
 print 'There are', cha, 'nonsilent mutations that lead to amino acid changes.'

Output

 Deletion A-C at 802
 Deletion C-C at 803
 Deletion G-G at 804
 Deletion A-A at 805
 Deletion G-G at 806
 Mismatch GGA at 658
 Mismatch A-C at 802
 There are 5 amino acids that are contain either no mutations or only silent mutations.
 There are 2844 nonsilent mutations that lead to amino acid changes.

Notes

  • There were also a few errors in the second half of my code which compares protein sequences, but I was not sure exactly what was causing these errors and would love any insight as to what went wrong. Thanks!
  • Just so you know, I updated my program so there are no errors when it runs now, but there may be a few nonintuitive parts to it that could be improved. Please let me know!
Personal tools