Harvard:Biophysics 101/2007/Notebook:Michael Wang/2007-2-20

From OpenWetWare
Revision as of 02:19, 22 February 2007 by Mdwang (talk | contribs)
Jump to navigationJump to search

For anyone still trying to get clustalw working on a PC after reading the link here, the key seems to be making sure that clustalw works from the command line. Even if you set it up properly, any problems in the actual call will give you the same error as if you didn't set it up properly. The only thing python cares about is whether or not the output file was created.

The current version of my code is not very intelligent on the analysis side. It currently sucks up all the fasta files in the ./input folder of the current directory and then compiles them into a single file. This file is passed into clustalw for alignment.

Features I'm still working on implementing

  1. Use regular expressions to identify orfs for all sequences:
   ##Will be implemented by capturing sequences between AUG and UAA, UGA, or UAG
  1. Using the orfs of the reference sequence as a well...reference...translate the sequences and find relevant mutations.
   ##Call the standard_translate to translate seuqneces
   ##Align translated sequences and look for differences in translated product
  1. Detect different types of delections: hopefully all implemented using regex
   ##Mismatch- in the alignment search for a *, blank, *
   ##Insertion/deletions- a string of blanks
      ###Differentiation will depend on identifying a gap in the original sequence by comparing it to reference
   ##Frameshift- a string of blankes that extends to the end of the ORF in the protein alignment
   ##Silent- ORF remains untouched

I hope to get this implemented sometime on the 22nd as I'm still trying to figure out the intricacies of regex in python and the seq class...

#!/usr/bin/env python

import os
from Bio import Clustalw

#This first section of code merges all fasta files located in the input folder of curdir
#into a single file called all.fasta
input_list = list(os.listdir(os.path.join(os.curdir,'input')))
print input_list
merged_file = open(os.path.join(os.curdir, 'all.fasta'),"w")
print os.path.join(os.curdir, 'all.fasta')
for i in input_list:
        print "loading ", os.path.join(os.curdir,'input\\',i)
        current_file = open(os.path.join(os.curdir,'input\\',i),"r")
        all_lines = current_file.readlines()
        merged_file.writelines(all_lines)
        current_file.close()
        merged_file.write("\n\n")
print "done making file"
merged_file.close()

#Once the merged file has been created, it is passed into the alignment program
cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'all.fasta'))
cline.set_output('test.aln')
alignment = Clustalw.do_alignment(cline)
all_records = alignment.get_all_seqs()

print alignment 

I have yet to write code to do counts of say, how many frameshift mutations there are, etc. It just prints the raw alignment for now.

Using test files Media:apoemod.fasta and Media:Copy of apoe.fasta, the following output is generated.

loading  .\input\apoe.fasta
loading  .\input\Copy of apoe.fasta
done making file
CLUSTAL X (1.81) multiple sequence alignment


gi|178350|gb|K00296.1|HUMAPOE3      CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC
gi|189350|gb|K10296.1|HUMAPOE3      CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC
gi|178850|gb|K00396.1|HUMAPOE3      CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC
gi|178843|gb|K06396.1|HUMAPOE3      CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG
gi|189350|gb|K10296.1|HUMAPOE3      AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG
gi|178850|gb|K00396.1|HUMAPOE3      AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG
gi|178843|gb|K06396.1|HUMAPOE3      AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC
gi|189350|gb|K10296.1|HUMAPOE3      CAGGATGCCAGGCCAAGGTGGAG--GGCGGTGGAGACAGAGCCGGAGCCC
gi|178850|gb|K00396.1|HUMAPOE3      CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC
gi|178843|gb|K06396.1|HUMAPOE3      CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC
                                    ***********************   ************************

gi|178350|gb|K00296.1|HUMAPOE3      GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC
gi|189350|gb|K10296.1|HUMAPOE3      GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC
gi|178850|gb|K00396.1|HUMAPOE3      GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC
gi|178843|gb|K06396.1|HUMAPOE3      GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      ACTGGGTCGCTTTTGGGATTAATCCTGCGCTGGGTGCAGACACTGTCTGA
gi|189350|gb|K10296.1|HUMAPOE3      ACTGGGTCGCTTTTGGGATTAATCCTGCGCTGGGTGCAGACACTGTCTGA
gi|178850|gb|K00396.1|HUMAPOE3      ACTGGGTCGCTTTTGGGATTA--CCTGCGCTGGGTGCAGACACTGTCTGA
gi|178843|gb|K06396.1|HUMAPOE3      ACTGGGTCGCTTTTGGGATTA--CCTGCGCTGGGTGCAGACACTGTCTGA
                                    *********************  ***************************

gi|178350|gb|K00296.1|HUMAPOE3      GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG
gi|189350|gb|K10296.1|HUMAPOE3      GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG
gi|178850|gb|K00396.1|HUMAPOE3      GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG
gi|178843|gb|K06396.1|HUMAPOE3      GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG
gi|189350|gb|K10296.1|HUMAPOE3      CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG
gi|178850|gb|K00396.1|HUMAPOE3      CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG
gi|178843|gb|K06396.1|HUMAPOE3      CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA
gi|189350|gb|K10296.1|HUMAPOE3      GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA
gi|178850|gb|K00396.1|HUMAPOE3      GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA
gi|178843|gb|K06396.1|HUMAPOE3      GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT
gi|189350|gb|K10296.1|HUMAPOE3      GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT
gi|178850|gb|K00396.1|HUMAPOE3      GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT
gi|178843|gb|K06396.1|HUMAPOE3      GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG
gi|189350|gb|K10296.1|HUMAPOE3      GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG
gi|178850|gb|K00396.1|HUMAPOE3      GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG
gi|178843|gb|K06396.1|HUMAPOE3      GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG
gi|189350|gb|K10296.1|HUMAPOE3      AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG
gi|178850|gb|K00396.1|HUMAPOE3      AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG
gi|178843|gb|K06396.1|HUMAPOE3      AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT
gi|189350|gb|K10296.1|HUMAPOE3      TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT
gi|178850|gb|K00396.1|HUMAPOE3      TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT
gi|178843|gb|K06396.1|HUMAPOE3      TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC
gi|189350|gb|K10296.1|HUMAPOE3      ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC
gi|178850|gb|K00396.1|HUMAPOE3      ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC
gi|178843|gb|K06396.1|HUMAPOE3      ACCAT-------CCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC
                                    ****        **************************************

gi|178350|gb|K00296.1|HUMAPOE3      GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT
gi|189350|gb|K10296.1|HUMAPOE3      GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT
gi|178850|gb|K00396.1|HUMAPOE3      GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT
gi|178843|gb|K06396.1|HUMAPOE3      GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG
gi|189350|gb|K10296.1|HUMAPOE3      GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG
gi|178850|gb|K00396.1|HUMAPOE3      GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG
gi|178843|gb|K06396.1|HUMAPOE3      GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC
gi|189350|gb|K10296.1|HUMAPOE3      AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC
gi|178850|gb|K00396.1|HUMAPOE3      AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC
gi|178843|gb|K06396.1|HUMAPOE3      AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA
gi|189350|gb|K10296.1|HUMAPOE3      CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA
gi|178850|gb|K00396.1|HUMAPOE3      CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA
gi|178843|gb|K06396.1|HUMAPOE3      CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA
gi|189350|gb|K10296.1|HUMAPOE3      GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA
gi|178850|gb|K00396.1|HUMAPOE3      GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA
gi|178843|gb|K06396.1|HUMAPOE3      GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG
gi|189350|gb|K10296.1|HUMAPOE3      AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG
gi|178850|gb|K00396.1|HUMAPOE3      AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG
gi|178843|gb|K06396.1|HUMAPOE3      AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC
gi|189350|gb|K10296.1|HUMAPOE3      CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC
gi|178850|gb|K00396.1|HUMAPOE3      CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC
gi|178843|gb|K06396.1|HUMAPOE3      CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA
gi|189350|gb|K10296.1|HUMAPOE3      CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA
gi|178850|gb|K00396.1|HUMAPOE3      CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA
gi|178843|gb|K06396.1|HUMAPOE3      CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG
gi|189350|gb|K10296.1|HUMAPOE3      CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG
gi|178850|gb|K00396.1|HUMAPOE3      CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG
gi|178843|gb|K06396.1|HUMAPOE3      CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG
gi|189350|gb|K10296.1|HUMAPOE3      CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG
gi|178850|gb|K00396.1|HUMAPOE3      CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG
gi|178843|gb|K06396.1|HUMAPOE3      CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG
                                    **************************************************

gi|178350|gb|K00296.1|HUMAPOE3      TTTCACGT
gi|189350|gb|K10296.1|HUMAPOE3      TTTCACGT
gi|178850|gb|K00396.1|HUMAPOE3      TTTCACGC
gi|178843|gb|K06396.1|HUMAPOE3      TTTCACGC
                                    *******

Each of the two files contains two sequences (I made fake changes to each). Yes...I realize that this is not particularly interesting.