Harvard:Biophysics 101/2007/Notebook:CChi/2007-2-27

From OpenWetWare
Jump to navigationJump to search

Assignments due 2/20 and 2/27

This script compares an input sequences to a reference sequence and reports:

ORFs

  • gives start and end index in terms of reference sequence
  • does not give nested ORFs

Deletions and Insertions

  • index and nucleotide of deletion or insertion
  • whether deletion or insertion caused a frameshift

Mutations

  • index and nucleotide of mutation
  • amino acids involved
  • whether mutation was silent or not


Problems I ran into (and how I may have attempted to solve them):

  • insertions make the reference sequence different in the .aln output. This causes trouble when comparing pretty much any codon back to the reference, so I had to create a separate, running consensus sequence to reference.
  • I did not like the idea of a nested ORF, so I didn't use them.
  • indexes given for deletions, insertions, and mutations represent the index of the input sequence, not the representative index from the reference sequence.
  • have not yet addressed the specifics of the amino acid changes (hydro philic/phobic, etc.)


Code

#!/usr/bin/env python

import os
from Bio import Clustalw, GenBank, Seq
from Bio.Align import AlignInfo
from sets import Set
from Bio.Seq import Seq,translate

cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'Apoe2.fasta'))
cline.set_output('test.aln') 
alignment = Clustalw.do_alignment(cline)

deletions = []  # list of deletion indices
insertions = [] # list of insertion indices
mutations = []  # list of mutation indices
mutcodon = ''   # temp codon with mutation of interest
mutconsAA = []  # consensus amino acid where mutation of interest occurred
mutAA = []      # amino acid after mutation
ORFstarts = []  # list of ORF start sites
ORFends = []    # list of ORF end sites
ORF = 0         # boolean for whether we are currently in an ORF
mut = 0         # boolean for whether the current codon contains a mutation
codon = ''      # temp codon variable
consensus = []  # consensus sequence without the pain of insertion '-'s
nucwithindex = []# nested list within consensus list [index, nucleotide]

for i in range(alignment.get_alignment_length()):
    col = alignment.get_column(i)
    s = Set() # create a new set
    for c in range(len(col)):
        s.add(col[c]) # add each column element to the set
    if col[0] != '-': # add to consensus whenever there's no insertion
        nucwithindex = [i, col[0]]
        consensus.append(nucwithindex) 
    if len(s) > 1: # multiple elements in s indicate a mismatch
        if col[1] == '-': # deletions
            deletions.append(i)
        else:
            if col[0] == '-': # insertions
                insertions.append(i)
            else: # mutations
                mutations.append(i)
                if ORF == 1 and mut != 1:
                # if there are 2 mutations in one codon, we only need to consider 1 codon unit
                    mut = 1
                else:
                # append dummy '-' to amino acids to avoid later confusion with indices
                    mutAA.append('-')
                    mutconsAA.append('-')
        
    # When not in ORF, we scan triplets for ATG, when in ORF, we look at codons
    if ORF == 0 and i>1 and col[0] != '-': # look for ORF's
        # attempt to look beyond insertions at the consensus
        if len(consensus) >=2:
            codon = consensus[len(consensus)-3][1] + consensus[len(consensus)-2][1] + col[0]
        if codon=='ATG':
            ORFstarts.append(consensus[len(consensus)-3][0])
            ORF = 1
            codon = ''
            print "ORF start", consensus[len(consensus)-3][0]
    else: # ORF's
        if len(codon) != 3:
            if col[0] != '-':
                codon = codon + col[0]        
        else:
            if mut == 1: # add consensus and mutated amino acids to list
                mutcodon = alignment.get_column(i-1)[1] + alignment.get_column(i-2)[1] + alignment.get_column(i-1)[1]
                mutAA.append(translate(mutcodon))
                mutconsAA.append(translate(codon))
                mut = 0
            if codon=='TAG' or codon=='TGA' or codon=='TAA': # end ORF
                ORFends.append(consensus[len(consensus)-3][0])
                ORF = 0
                print "ORF end", consensus[len(consensus)-3][0]
            codon = ''
    
inORF = 0 # boolean for whether it's in ORF to help with printing

print "\nDELETIONS"
for i in range(len(deletions)):
    col = alignment.get_column(deletions[i])
    for j in range(len(ORFends)): # check whether the deletion was in an ORF
        if deletions[i] <= ORFends[j] and deletions[i] >= ORFstarts[j]:
            print "Index: ", deletions[i], col[0], "Frameshift!"
            inORF = 1
    if inORF == 0:
        print "Index: ", deletions[i], col[0], "deletion in intron"
    inORF = 0

print "\nINSERTIONS"
for i in range(len(insertions)):
    col = alignment.get_column(insertions[i])
    for j in range(len(ORFends)): # check whether the insertion was in an ORF
        if insertions[i] <= ORFends[j] and insertions[i] >= ORFstarts[j]:
            print "Index: ", insertions[i], col[1], "Frameshift!"
            inORF = 1
    if inORF == 0:
        print "Index: ", insertions[i], col[1], "insertion in intron"
    inORF = 0
    
print "\nMUTATIONS"
for i in range(len(mutations)):
    col = alignment.get_column(mutations[i])
    if mutAA[i] == '-':
        print "Index: ", mutations[i], col[0], "to", col[1]
    else:
        if mutAA[i] == mutconsAA[i]: # check whether mutation was silent
            print "Index: ", mutations[i], col[0], "to", col[1], "silent mutation, amino acid remains", mutAA[i]
        else:
            print "Index: ", mutations[i], col[0], "to", col[1], "changed amino acid", mutconsAA[i], "to", mutAA[i]

Output

ORF start 61
ORF end 1014
ORF start 1034

DELETIONS
Index:  17 A deletion in intron
Index:  236 A Frameshift!
Index:  1108 C deletion in intron

INSERTIONS
Index:  34 T insertion in intron
Index:  380 G Frameshift!
Index:  1093 A insertion in intron

MUTATIONS
Index:  534 A to C
Index:  535 C to A changed amino acid H to T
Index:  537 T to C
Index:  538 G to T changed amino acid C to L
Index:  549 G to A changed amino acid V to I
Index:  660 G to A changed amino acid W to G
Index:  797 C to T silent mutation, amino acid remains R
Index:  804 A to C changed amino acid W to G
Index:  1077 T to A changed amino acid C to R