Talk:Harvard:Biophysics 101/2007/02/13: Difference between revisions
From OpenWetWare
Jump to navigationJump to search
(additional info on extracting alignment stuff) |
No edit summary |
||
Line 81: | Line 81: | ||
==More on extracting alignment info== | ==More on extracting alignment info== | ||
One thing I (Xiaodi) have found to be helpful is to programmatically break down the problem into comparing each sequence to the reference itemwise, and using the returned, aligned data to find insertions and deletions. It allows for a more intuitive, I think, manner of handling the problem than aligning things vertically across more than two sequences and making determinations that way. Don't know if others will see it that way, but the code in my notebook works on this premise. | One thing I (Xiaodi) have found to be helpful is to programmatically break down the problem into comparing each sequence to the reference itemwise, and using the returned, aligned data to find insertions and deletions. It allows for a more intuitive, I think, manner of handling the problem than aligning things vertically across more than two sequences and making determinations that way. Don't know if others will see it that way, but the code in my notebook works on this premise. | ||
==Silent and Non-Silent Mutations due to Wobble Position Point Mutations== | |||
* I ([[User:TChan|Tiff]]) thought it'd be interesting to be able to tell when point mutations at the third "wobble" position in a codon resulted - or did not result, thus making the point mutation a '''silent''' one - in a change in the amino acid. | |||
** Sadly, at this time, my computer is dead and I had to use a comp lab computer to code this. Since the comp lab computer did not want to give me administrator access to allow changes to the path variables for ClustalW to work, and I for some reason couldn't figure out how to work it by importing a path, I just coded this without actually being able to check to see if it would run. Sorry :(. I'll change it once I get my own computer up and running again. | |||
** Also, sorry for the non-conciseness of this program, I will revise when my comp decides to work again. | |||
<pre> | |||
#!/usr/bin/env python | |||
import os | |||
from Bio import Clustalw | |||
import sys | |||
sys.path.append("/nfs/home/c/h/chan2/") | |||
cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'apoe.fasta')) | |||
cline.set_output('test.aln') | |||
alignment = Clustalw.do_alignment(cline) | |||
summary_align = AlignInfo.SummaryInfo(alignment) | |||
consensus = summary_align.dumb_consensus() | |||
my_pssm = summary_align.pos_specific_score_matrix(consensus, | |||
chars_to_ignore = ['N']) | |||
num_of_seqs = my_pssm[0]['A'] + my_pssm[0]['C'] + my_pssm[0]['G'] + my_pssm[0]['T'] + my_pssm[0]['N'] | |||
codon_holder = '' | |||
non_silent_mut = 0 | |||
silen_mut = 0 | |||
for i in range(len(consensus)): | |||
codon_holder.join(consensus[i]) | |||
# if on a wobble-position | |||
if pseudo_codon %3 == 0: | |||
# if any of the sequences does not match the consensus sequence at that position (ie. if there is a mutation in any of the sequences) | |||
if my_pssm[i]['A'] < num_of_seqs or my_pssm[i]['C'] < num_of_seqs or my_pssm[i]['G'] < num_of_seqs or my_pssm[i]['T'] < num_of_seqs: | |||
pseudo_codon_seq = Seq(pseudo_codon) | |||
pseudo_codon_prot = translate(pseudo_codon_seq) | |||
# print the number of occurences of As, Cs, Gs, and Ts at the wobble position | |||
print '%4d -> %s: %d, %s: %d, %s: %d, %s: %d' % (i, 'A', my_pssm[i]['A'], 'C', my_pssm[i]['C'], 'G', my_pssm[i]['G'], 'T', my_pssm[i]['T']) | |||
# The following code translates the pseudo-codon into the amino acid it'd code for if another base had been in the wobble position | |||
pseudo_codon_minus = pseudo_codon[0:1] | |||
pseudo_codon_A = pseudo_codon_minus + 'A' | |||
pseudo_codon_Aseq = Seq(pseudo_codon_A) | |||
pseudo_codon_Aprot = translate(pseudo_codon_Aseq) | |||
if pseudo_codon == pseudo_codon_A and my_pssm[i]['A'] != 0: | |||
silent_mut += 1 | |||
elif pseudo_codon != pseudo_codon_A and my_pssm[i]['A'] != 0: | |||
non_silent_mut +=1 | |||
pseudo_codon_C = pseudo_codon_minus + 'C' | |||
pseudo_codon_Cseq = Seq(pseudo_codon_C) | |||
pseudo_codon_Cprot = translate(pseudo_codon_Cseq) | |||
if pseudo_codon == pseudo_codon_C and my_pssm[i]['C'] != 0: | |||
silent_mut += 1 | |||
elif pseudo_codon != pseudo_codon_C and my_pssm[i]['C'] != 0: | |||
non_silent_mut +=1 | |||
pseudo_codon_G = pseudo_codon_minus + 'G' | |||
pseudo_codon_Gseq = Seq(pseudo_codon_G) | |||
pseudo_codon_Gprot = translate(pseudo_codon_Gseq) | |||
if pseudo_codon == pseudo_codon_G and my_pssm[i]['G'] != 0: | |||
silent_mut += 1 | |||
elif pseudo_codon != pseudo_codon_G and my_pssm[i]['G'] != 0: | |||
non_silent_mut +=1 | |||
pseudo_codon_T = pseudo_codon_minus + 'T' | |||
pseudo_codon_Tseq = Seq(pseudo_codon_T) | |||
pseudo_codon_Tprot = translate(pseudo_codon_Tseq) | |||
if pseudo_codon == pseudo_codon_T and my_pssm[i]['T'] != 0: | |||
silent_mut += 1 | |||
elif pseudo_codon != pseudo_codon_T and my_pssm[i]['T'] != 0: | |||
non_silent_mut +=1 | |||
# Directly under the As, Cs, Gs, or Ts at the wobble position, print the amino acid that results from the point mutation | |||
print '%4s -> %s: %s, %s: %s, %s: %s, %s: %s'(pseudo_codon_prot, 'A', pseudo_codon_Aprot, 'C', pseudo_codon_Cprot, 'G', pseudo_codon_Gprot, | |||
'T', pseudo_codon_Tprot) | |||
codon_holder = '' | |||
print 'Total silent mutations due to wobble position mutations: ', silent_mut | |||
print 'Total non-silent mutations at wobble position mutations: ', non_silent_mut | |||
</pre> |
Revision as of 00:27, 20 February 2007
Sequence-level differences to consider
- Point mutations
- silent/synonymous
- missense
- nonsense
- translational stability (re: codon bias)?
- Deletions (and insertions)
- frame shift
- downstream effects?
- Add more here...
- Coding vs nonCoding seq
- We should be able to determine AUG and stop codons...
- Copy-number variants
- This can be viewed as a type of insertion or deletion.
General coding questions / ideas
- How should we handle file input?
- How should we format the output?
- Generation of a set of test inputs.
- How can the identification of a class of differences be identified with a set of implementable conditions (ie how do you seperate out a duplication or an insertion from a rearrangement.
Using clustalw for sequence alignment
- clustalw can be used for performing multiple sequence alignments
- BioPython provides a wrapper for clustalw through the Bio.Clustalw package. However, clustalw needs to be downloaded and installed before it is accessible using your biopython scripts.
- As Zsun noticed, the BioPython cookbook has some outdated example code for accessing clustalw. An online Python Course in Bioinformatics by Katja Schuerer and Catherine Letondal has several better examples with correct code.
- Once you have clustalw installed and accessible via python, download an example apoe.fasta and try running the following code.
#!/usr/bin/env python import os from Bio import Clustalw cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'apoe.fasta')) cline.set_output('test.aln') alignment = Clustalw.do_alignment(cline)
- This should give you an output file, test.aln, which you should inspect.
- Try modifying apoe.fasta, and upload new versions as you improve it to include all of the test cases enumerated above.
- Note from Kay: If you are creating/modifying your fasta file in Python, you must close the file before attempting to pass it into Clustalw. Otherwise, it won't load properly.
Extracting information from an alignment
- Look at Bio/Align/Generic.py for ideas about what you can do with an alignment
- Once you have obtained an alignment, one way to parse it is to look at each column and check for differences
- Here is some initial code to give you some ideas.
#!/usr/bin/env python import os from Bio import Clustalw from Bio.Align import AlignInfo from sets import Set cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'apoe.fasta')) cline.set_output('test.aln') alignment = Clustalw.do_alignment(cline) for i in range(alignment.get_alignment_length()): col = alignment.get_column(i) s = Set() # create a new set for c in range(len(col)): s.add(col[c]) # add each column element to the set if len(s) > 1: # multiple elements in s indicate a mismatch print i, col # Here is another way to locate sequence differences # it is still necessary to look at columns to determine type diffs = [] stars = alignment._star_info for i in range(len(stars)): if not stars[i] == '*': diffs.append(i) print diffs
- You may elaborate on this code to start handling specific cases of sequence mismatches.
- Try to think about how differences in the alignment correspond to the cases enumerated above.
- One thing I (katie) noticed you can do is generate a consensus sequence pretty easily. Click here and scroll down to find out how.
More on extracting alignment info
One thing I (Xiaodi) have found to be helpful is to programmatically break down the problem into comparing each sequence to the reference itemwise, and using the returned, aligned data to find insertions and deletions. It allows for a more intuitive, I think, manner of handling the problem than aligning things vertically across more than two sequences and making determinations that way. Don't know if others will see it that way, but the code in my notebook works on this premise.
Silent and Non-Silent Mutations due to Wobble Position Point Mutations
- I (Tiff) thought it'd be interesting to be able to tell when point mutations at the third "wobble" position in a codon resulted - or did not result, thus making the point mutation a silent one - in a change in the amino acid.
- Sadly, at this time, my computer is dead and I had to use a comp lab computer to code this. Since the comp lab computer did not want to give me administrator access to allow changes to the path variables for ClustalW to work, and I for some reason couldn't figure out how to work it by importing a path, I just coded this without actually being able to check to see if it would run. Sorry :(. I'll change it once I get my own computer up and running again.
- Also, sorry for the non-conciseness of this program, I will revise when my comp decides to work again.
#!/usr/bin/env python import os from Bio import Clustalw import sys sys.path.append("/nfs/home/c/h/chan2/") cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'apoe.fasta')) cline.set_output('test.aln') alignment = Clustalw.do_alignment(cline) summary_align = AlignInfo.SummaryInfo(alignment) consensus = summary_align.dumb_consensus() my_pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore = ['N']) num_of_seqs = my_pssm[0]['A'] + my_pssm[0]['C'] + my_pssm[0]['G'] + my_pssm[0]['T'] + my_pssm[0]['N'] codon_holder = '' non_silent_mut = 0 silen_mut = 0 for i in range(len(consensus)): codon_holder.join(consensus[i]) # if on a wobble-position if pseudo_codon %3 == 0: # if any of the sequences does not match the consensus sequence at that position (ie. if there is a mutation in any of the sequences) if my_pssm[i]['A'] < num_of_seqs or my_pssm[i]['C'] < num_of_seqs or my_pssm[i]['G'] < num_of_seqs or my_pssm[i]['T'] < num_of_seqs: pseudo_codon_seq = Seq(pseudo_codon) pseudo_codon_prot = translate(pseudo_codon_seq) # print the number of occurences of As, Cs, Gs, and Ts at the wobble position print '%4d -> %s: %d, %s: %d, %s: %d, %s: %d' % (i, 'A', my_pssm[i]['A'], 'C', my_pssm[i]['C'], 'G', my_pssm[i]['G'], 'T', my_pssm[i]['T']) # The following code translates the pseudo-codon into the amino acid it'd code for if another base had been in the wobble position pseudo_codon_minus = pseudo_codon[0:1] pseudo_codon_A = pseudo_codon_minus + 'A' pseudo_codon_Aseq = Seq(pseudo_codon_A) pseudo_codon_Aprot = translate(pseudo_codon_Aseq) if pseudo_codon == pseudo_codon_A and my_pssm[i]['A'] != 0: silent_mut += 1 elif pseudo_codon != pseudo_codon_A and my_pssm[i]['A'] != 0: non_silent_mut +=1 pseudo_codon_C = pseudo_codon_minus + 'C' pseudo_codon_Cseq = Seq(pseudo_codon_C) pseudo_codon_Cprot = translate(pseudo_codon_Cseq) if pseudo_codon == pseudo_codon_C and my_pssm[i]['C'] != 0: silent_mut += 1 elif pseudo_codon != pseudo_codon_C and my_pssm[i]['C'] != 0: non_silent_mut +=1 pseudo_codon_G = pseudo_codon_minus + 'G' pseudo_codon_Gseq = Seq(pseudo_codon_G) pseudo_codon_Gprot = translate(pseudo_codon_Gseq) if pseudo_codon == pseudo_codon_G and my_pssm[i]['G'] != 0: silent_mut += 1 elif pseudo_codon != pseudo_codon_G and my_pssm[i]['G'] != 0: non_silent_mut +=1 pseudo_codon_T = pseudo_codon_minus + 'T' pseudo_codon_Tseq = Seq(pseudo_codon_T) pseudo_codon_Tprot = translate(pseudo_codon_Tseq) if pseudo_codon == pseudo_codon_T and my_pssm[i]['T'] != 0: silent_mut += 1 elif pseudo_codon != pseudo_codon_T and my_pssm[i]['T'] != 0: non_silent_mut +=1 # Directly under the As, Cs, Gs, or Ts at the wobble position, print the amino acid that results from the point mutation print '%4s -> %s: %s, %s: %s, %s: %s, %s: %s'(pseudo_codon_prot, 'A', pseudo_codon_Aprot, 'C', pseudo_codon_Cprot, 'G', pseudo_codon_Gprot, 'T', pseudo_codon_Tprot) codon_holder = '' print 'Total silent mutations due to wobble position mutations: ', silent_mut print 'Total non-silent mutations at wobble position mutations: ', non_silent_mut