Harvard:Biophysics 101/2007/Notebook:Michael Wang/2007-2-13
For anyone still trying to get clustalw working on a PC after reading the link here, the key seems to be making sure that clustalw works from the command line. Even if you set it up properly, any problems in the actual call will give you the same error as if you didn't set it up properly. The only thing python cares about is whether or not the output file was created.
The current version of my code is not very intelligent on the analysis side. It currently sucks up all the fasta files in the ./import folder of the current directory and then compiles them into a single file. This file is passed into clustalw for alignment.
#!/usr/bin/env python import os from Bio import Clustalw #This first section of code merges all fasta files located in the input folder of curdir #into a single file called all.fasta input_list = list(os.listdir(os.path.join(os.curdir,'input'))) print input_list merged_file = open(os.path.join(os.curdir, 'all.fasta'),"w") print os.path.join(os.curdir, 'all.fasta') for i in input_list: print "loading ", os.path.join(os.curdir,'input\\',i) current_file = open(os.path.join(os.curdir,'input\\',i),"r") all_lines = current_file.readlines() merged_file.writelines(all_lines) current_file.close() merged_file.write("\n\n") print "done making file" merged_file.close() #Once the merged file has been created, it is passed into the alignment program cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'all.fasta')) cline.set_output('test.aln') alignment = Clustalw.do_alignment(cline) all_records = alignment.get_all_seqs() print alignment
I have yet to write code to do counts of say, how many frameshift mutations there are, etc. It just prints the raw alignment for now.
Using a test files uploaded Media:apoe.fasta and Media:Copy of apoe.fasta, the following output is generated.
loading .\input\apoe.fasta loading .\input\Copy of apoe.fasta done making file CLUSTAL X (1.81) multiple sequence alignment gi|178350|gb|K00296.1|HUMAPOE3 CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC gi|189350|gb|K10296.1|HUMAPOE3 CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC gi|178850|gb|K00396.1|HUMAPOE3 CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC gi|178843|gb|K06396.1|HUMAPOE3 CGCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGACTGGCCAATCAC ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG gi|189350|gb|K10296.1|HUMAPOE3 AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG gi|178850|gb|K00396.1|HUMAPOE3 AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG gi|178843|gb|K06396.1|HUMAPOE3 AGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGG ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC gi|189350|gb|K10296.1|HUMAPOE3 CAGGATGCCAGGCCAAGGTGGAG--GGCGGTGGAGACAGAGCCGGAGCCC gi|178850|gb|K00396.1|HUMAPOE3 CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC gi|178843|gb|K06396.1|HUMAPOE3 CAGGATGCCAGGCCAAGGTGGAGCAAGCGGTGGAGACAGAGCCGGAGCCC *********************** ************************ gi|178350|gb|K00296.1|HUMAPOE3 GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC gi|189350|gb|K10296.1|HUMAPOE3 GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC gi|178850|gb|K00396.1|HUMAPOE3 GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC gi|178843|gb|K06396.1|HUMAPOE3 GAGCTGCGCCAGCAGACCGAGTGGCAGAGCGGCCAGCGCTGGGAACTGGC ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 ACTGGGTCGCTTTTGGGATTAATCCTGCGCTGGGTGCAGACACTGTCTGA gi|189350|gb|K10296.1|HUMAPOE3 ACTGGGTCGCTTTTGGGATTAATCCTGCGCTGGGTGCAGACACTGTCTGA gi|178850|gb|K00396.1|HUMAPOE3 ACTGGGTCGCTTTTGGGATTA--CCTGCGCTGGGTGCAGACACTGTCTGA gi|178843|gb|K06396.1|HUMAPOE3 ACTGGGTCGCTTTTGGGATTA--CCTGCGCTGGGTGCAGACACTGTCTGA ********************* *************************** gi|178350|gb|K00296.1|HUMAPOE3 GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG gi|189350|gb|K10296.1|HUMAPOE3 GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG gi|178850|gb|K00396.1|HUMAPOE3 GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG gi|178843|gb|K06396.1|HUMAPOE3 GCAGGTGCAGGAGGAGCTGCTCAGCTCCCAGGTCACCCAGGAACTGAGGG ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG gi|189350|gb|K10296.1|HUMAPOE3 CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG gi|178850|gb|K00396.1|HUMAPOE3 CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG gi|178843|gb|K06396.1|HUMAPOE3 CGCTGATGGACGAGACCATGAAGGAGTTGAAGGCCTACAAATCGGAACTG ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA gi|189350|gb|K10296.1|HUMAPOE3 GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA gi|178850|gb|K00396.1|HUMAPOE3 GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA gi|178843|gb|K06396.1|HUMAPOE3 GAGGAACAACTGACCCCGGTGGCGGAGGAGACGCGGGCACGGCTGTCCAA ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT gi|189350|gb|K10296.1|HUMAPOE3 GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT gi|178850|gb|K00396.1|HUMAPOE3 GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT gi|178843|gb|K06396.1|HUMAPOE3 GGAGCTGCAGGCGGCGCAGGCCCGGCTGGGCGCGGACATGGAGGACGTGT ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG gi|189350|gb|K10296.1|HUMAPOE3 GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG gi|178850|gb|K00396.1|HUMAPOE3 GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG gi|178843|gb|K06396.1|HUMAPOE3 GCGGCCGCCTGGTGCAGTACCGCGGCGAGGTGCAGGCCATGCTCGGCCAG ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG gi|189350|gb|K10296.1|HUMAPOE3 AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG gi|178850|gb|K00396.1|HUMAPOE3 AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG gi|178843|gb|K06396.1|HUMAPOE3 AGCACCGAGGAGCTGCGGGTGCGCCTCGCCTCCCACCTGCGCAAGCTGCG ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT gi|189350|gb|K10296.1|HUMAPOE3 TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT gi|178850|gb|K00396.1|HUMAPOE3 TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT gi|178843|gb|K06396.1|HUMAPOE3 TAAGCGGCTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGT ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC gi|189350|gb|K10296.1|HUMAPOE3 ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC gi|178850|gb|K00396.1|HUMAPOE3 ACCAGGCCGGGGCCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC gi|178843|gb|K06396.1|HUMAPOE3 ACCAT-------CCCGCGAGGGCGCCGAGCGCGGCCTCAGCGCCATCCGC **** ************************************** gi|178350|gb|K00296.1|HUMAPOE3 GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT gi|189350|gb|K10296.1|HUMAPOE3 GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT gi|178850|gb|K00396.1|HUMAPOE3 GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT gi|178843|gb|K06396.1|HUMAPOE3 GAGCGCCTGGGGCCCCTGGTGGAACAGGGCCGCGTGCGGGCCGCCACTGT ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG gi|189350|gb|K10296.1|HUMAPOE3 GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG gi|178850|gb|K00396.1|HUMAPOE3 GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG gi|178843|gb|K06396.1|HUMAPOE3 GGGCTCCCTGGCCGGCCAGCCGCTACAGGAGCGGGCCCAGGCCTGGGGCG ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC gi|189350|gb|K10296.1|HUMAPOE3 AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC gi|178850|gb|K00396.1|HUMAPOE3 AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC gi|178843|gb|K06396.1|HUMAPOE3 AGCGGCTGCGCGCGCGGATGGAGGAGATGGGCAGCCGGACCCGCGACCGC ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA gi|189350|gb|K10296.1|HUMAPOE3 CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA gi|178850|gb|K00396.1|HUMAPOE3 CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA gi|178843|gb|K06396.1|HUMAPOE3 CTGGACGAGGTGAAGGAGCAGGTGGCGGAGGTGCGCGCCAAGCTGGAGGA ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA gi|189350|gb|K10296.1|HUMAPOE3 GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA gi|178850|gb|K00396.1|HUMAPOE3 GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA gi|178843|gb|K06396.1|HUMAPOE3 GCAGGCCCAGCAGATACGCCTGCAGGCCGAGGCCTTCCAGGCCCGCCTCA ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG gi|189350|gb|K10296.1|HUMAPOE3 AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG gi|178850|gb|K00396.1|HUMAPOE3 AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG gi|178843|gb|K06396.1|HUMAPOE3 AGAGCTGGTTCGAGCCCCTGGTGGAAGACATGCAGCGCCAGTGGGCCGGG ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC gi|189350|gb|K10296.1|HUMAPOE3 CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC gi|178850|gb|K00396.1|HUMAPOE3 CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC gi|178843|gb|K06396.1|HUMAPOE3 CTGGTGGAGAAGGTGCAGGCTGCCGTGGGCACCAGCGCCGCCCCTGTGCC ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA gi|189350|gb|K10296.1|HUMAPOE3 CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA gi|178850|gb|K00396.1|HUMAPOE3 CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA gi|178843|gb|K06396.1|HUMAPOE3 CAGCGACAATCACTGAACGCCGAAGCCTGCAGCCATGCGACCCCACGCCA ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG gi|189350|gb|K10296.1|HUMAPOE3 CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG gi|178850|gb|K00396.1|HUMAPOE3 CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG gi|178843|gb|K06396.1|HUMAPOE3 CCCCGTGCCTCCTGCCTCCGCGCAGCCTGCAGCGGGAGACCCTGTCCCCG ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG gi|189350|gb|K10296.1|HUMAPOE3 CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG gi|178850|gb|K00396.1|HUMAPOE3 CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG gi|178843|gb|K06396.1|HUMAPOE3 CCCCAGCCGTCCTCCTGGGGTGGACCCTAGTTTAATAAAGATTCACCAAG ************************************************** gi|178350|gb|K00296.1|HUMAPOE3 TTTCACGT gi|189350|gb|K10296.1|HUMAPOE3 TTTCACGT gi|178850|gb|K00396.1|HUMAPOE3 TTTCACGC gi|178843|gb|K06396.1|HUMAPOE3 TTTCACGC *******
Each of the two files contains two sequences (I made fake changes to each).