Harvard:Biophysics 101/2007/Notebook:Xiaodi Wu/2007-3-20: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
No edit summary
(reformatted page to make it readable)
Line 1: Line 1:
Hey all, here's a sketch of the algo/pseudocode I propose for this...feel free to flesh it out and modify/edit/discuss
Hey all, here's a sketch of the algo/pseudocode I propose for this...feel free to flesh it out and modify/edit/discuss


1. take sequence: look up on Blast (see http://www.dalkescientific.com/writings/NBN/blast_searching.html); perhaps call this function sequence_lookup(str), returning some sort of object. for now, let's say the object includes best_gene_hit and best_genomic_position_hit, which includes chr (chromosome) and chrpos (position on chromosome), and the p-values for each match (or whatever they call them -- I think it might be called 'expected value')
===Blast original query===
*Take sequence: look up on Blast (see http://www.dalkescientific.com/writings/NBN/blast_searching.html);
*Perhaps call this function sequence_lookup(str), returning some sort of object.
*For now, let's say the object includes best_gene_hit and best_genomic_position_hit, which includes chr (chromosome) and chrpos (position on chromosome), and the p-values for each match (or whatever they call them -- I think it might be called 'expected value')


Mike) Maybe we should call the object a blastcomparison.  It would consist of a list of Match objects which each in turn contain the chromosome, chrpos, adjacent_genes (maybe itself a list) and p-values.  Calling sequence_lookup(str) will then return a blastcomparison.  (Yeah...I'm a C/java guy too...)
:Maybe we should call the object a blastcomparison.  It would consist of a list of Match objects which each in turn contain the chromosome, chrpos, adjacent_genes (maybe itself a list) and p-values.  Calling sequence_lookup(str) will then return a blastcomparison.  (Yeah...I'm a C/java guy too...) --'''[[User:Mdwang|Mike]]'''


2. if there is a gene hit (i.e. if sequence_lookup(str).best_gene_hit exists and has a p-value (expected value) above some threshold, then consider the given sequence to be part of a gene. then:
===Parse blast results===
'''Case 1: Direct gene hits'''
*If there is a gene hit (i.e. if sequence_lookup(str).best_gene_hit exists and has a p-value (expected value) above some threshold, then consider the given sequence to be part of a gene. Then...


2a). translate gene (call this function translate_in_frame(str); I have an algo that goes through all the frames and finds the most likely ORF; works beautifully but a little slowly, but it will do -- we don't have to write this part of the algo), and locate mutations (locate_mutations(str, ref_str), returning a list (what in C would be an array -- I might slip into C lingo every so often so this is what I mean) containing the type of mutation (point mutation, insertion, deletion) in both a.a. and DNA sequence; again, I think we all have an algo for this)
*Translate gene (call this function translate_in_frame(str); I have an algo that goes through all the frames and finds the most likely ORF; works beautifully but a little slowly, but it will do -- we don't have to write this part of the algo), and locate mutations (locate_mutations(str, ref_str), returning a list (what in C would be an array -- I might slip into C lingo every so often so this is what I mean) containing the type of mutation (point mutation, insertion, deletion) in both a.a. and DNA sequence; again, I think we all have an algo for this)


Mike)I have something that searches for orfs and then returns an object orf_list which specifies their positions in the original search string and parses into codons.  In other words, each orf is a list of codon objects which in turn contain the sequence and Aa translation of that codon.  My mutation detection function does pairwise comparisons of strings to identify types of mutations when one is considered a refseq.  It then returns a list of Mutation objects (...I went object crazy...)
:*I have something that searches for orfs and then returns an object orf_list which specifies their positions in the original search string and parses into codons.  In other words, each orf is a list of codon objects which in turn contain the sequence and Aa translation of that codon.  My mutation detection function does pairwise comparisons of strings to identify types of mutations when one is considered a refseq.  It then returns a list of Mutation objects (...I went object crazy...) --'''[[User:Mdwang|Mike]]'''
::*I also have code that can take a sequence compare it against a reference and then return the found mutations (of the type of interest for the March 13th assignment) within the nucleic acid seuence as well as the protein sequence.  The code could also be easily modified to determine whether a given amino acid change is 'significant' (hydrophilic->hydrophobic, etc.) --'''[[User:RCharalel|Resmi]]'''


Resmi) I also have code that can take a sequence compare it against a reference and then return the found mutations (of the type of interest for the March 13th assignment) within the nucleic acid seuence as well as the protein sequence.  The code could also be easily modified to determine whether a given amino acid change is 'significant' (hydrophilic->hydrophobic, etc.)
'''Case 2: non-gene Blast hits'''
*If there is no gene hit (like the example of 13 March, which was non-coding, supposedly), take the best_genomic_position


2b). look up these mutations for the gene on OMIM (call this function omim_gene_search(genbank_id, muts), where muts is a list of mutations from 2a to look for; for genes they are listed in OMIM in the format {amino acid}{position}{amino acid} instead of {nucleotide}{position}{nucleotide}; see http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html for info on how to search all NCBI db's)
*Again, locate mutations, but only in the nucleotide sequence (locate_noncoding_mutations(str, ref_str)) and also maybe do a tblastx (or just blastx) [hrm...is this too much?]
*This is not trivial; I could not find any way to program a link to OMIM, and there are no references online. The best I could find was the ability to download the OMIM database at: ftp://ftp.ncbi.nih.gov/repository/OMIM/ , thereby creating a 50MB text file which can then be parsed (that would be '''painful'''). Otherwise, there is a MySQL script which does the same, but that requires we know MySQL... http://search.cpan.org/~BIRNEY/bioperl-1.2.3/Bio/Phenotype/OMIM/OMIMparser.pm . --[[User:Zsun|Zsun]] 00:00, 18 March 2007 (EDT)
Mike) Yeah....I'm pretty against working with OMIM unless we can get a fast server to run this on and just let it sit...  I still have to look at the eutils but that looks promising (Zach, you said it was already implemented in Biopython?)
*Well, EUtils does look like the most promising path. I'm sorry for not highlighting that earlier, but the link that I pointed to above (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) has a good amount of info on how this works, though I can't say that I've looked at it closely yet --[[User:Wuxiaodi|wuxiaodi]] 03:12, 18 March 2007 (EDT)


3. if there is no gene hit (like the example of 13 March, which was non-coding, supposedly), take the best_genomic_position
*Look up the chromosome position in dbSNP and also this database, potentially: http://projects.tcag.ca/variation/


3a). again, locate mutations, but only in the nucleotide sequence (locate_noncoding_mutations(str, ref_str)) and also maybe do a tblastx (or just blastx) [hrm...is this too much?]
:*This is looking better: At http://lists.open-bio.org/pipermail/biopython/2006-March/002971.html there is a modified biopython script to deal with SNP lookup... directly gives us the code, which is good.--'''[[User:Zsun|Zsun]] 00:06, 18 March 2007 (EDT)'''


3b). look up the chromosome position in dbSNP and also this database, potentially: http://projects.tcag.ca/variation/
*Fnd the IDs of known SNPs and CNVs, compare to what we have about our own sequence, and then search OMIM with this info (call the function omim_noncoding_search, with parameters TBD)
*This is looking better: At http://lists.open-bio.org/pipermail/biopython/2006-March/002971.html there is a modified biopython script to deal with SNP lookup... directly gives us the code, which is good.--[[User:Zsun|Zsun]] 00:06, 18 March 2007 (EDT)
:*I think this would be an easier way to deal with OMIM since presumably if we parse the OMIM database into entries, they will be organized in this fashion → so less searching through entries until we find the specific one we are interested in. --'''[[User:RCharalel|Resmi]]'''
3c). find the IDs of known SNPs and CNVs, compare to what we have about our own sequence, and then search OMIM with this info (call the function omim_noncoding_search, with parameters TBD)
 
*Resmi) I think this would be an easier way to deal with OMIM since presumably if we parse the OMIM database into entries, they will be organized in this fashion -> so less searching through entries until we find the specific one we are interested in.
===Perform OMIM search===
*Look up these mutations for the gene on OMIM (call this function omim_gene_search(genbank_id, muts), where muts is a list of mutations from 2a to look for; for genes they are listed in OMIM in the format {amino acid}{position}{amino acid} instead of {nucleotide}{position}{nucleotide}; see http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html for info on how to search all NCBI db's)
 
:*This is not trivial; I could not find any way to program a link to OMIM, and there are no references online. The best I could find was the ability to download the OMIM database at: ftp://ftp.ncbi.nih.gov/repository/OMIM/ , thereby creating a 50MB text file which can then be parsed (that would be '''painful'''). Otherwise, there is a MySQL script which does the same, but that requires we know MySQL... http://search.cpan.org/~BIRNEY/bioperl-1.2.3/Bio/Phenotype/OMIM/OMIMparser.pm . --'''[[User:Zsun|Zsun]] 00:00, 18 March 2007 (EDT)'''
::*Yeah....I'm pretty against working with OMIM unless we can get a fast server to run this on and just let it sit...  I still have to look at the eutils but that looks promising (Zach, you said it was already implemented in Biopython?)
:::*Well, EUtils does look like the most promising path. I'm sorry for not highlighting that earlier, but the link that I pointed to above (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) has a good amount of info on how this works, though I can't say that I've looked at it closely yet --'''[[User:Wuxiaodi|wuxiaodi]] 03:12, 18 March 2007 (EDT)'''

Revision as of 08:44, 18 March 2007

Hey all, here's a sketch of the algo/pseudocode I propose for this...feel free to flesh it out and modify/edit/discuss

Blast original query

  • Take sequence: look up on Blast (see http://www.dalkescientific.com/writings/NBN/blast_searching.html);
  • Perhaps call this function sequence_lookup(str), returning some sort of object.
  • For now, let's say the object includes best_gene_hit and best_genomic_position_hit, which includes chr (chromosome) and chrpos (position on chromosome), and the p-values for each match (or whatever they call them -- I think it might be called 'expected value')
Maybe we should call the object a blastcomparison. It would consist of a list of Match objects which each in turn contain the chromosome, chrpos, adjacent_genes (maybe itself a list) and p-values. Calling sequence_lookup(str) will then return a blastcomparison. (Yeah...I'm a C/java guy too...) --Mike

Parse blast results

Case 1: Direct gene hits

  • If there is a gene hit (i.e. if sequence_lookup(str).best_gene_hit exists and has a p-value (expected value) above some threshold, then consider the given sequence to be part of a gene. Then...
  • Translate gene (call this function translate_in_frame(str); I have an algo that goes through all the frames and finds the most likely ORF; works beautifully but a little slowly, but it will do -- we don't have to write this part of the algo), and locate mutations (locate_mutations(str, ref_str), returning a list (what in C would be an array -- I might slip into C lingo every so often so this is what I mean) containing the type of mutation (point mutation, insertion, deletion) in both a.a. and DNA sequence; again, I think we all have an algo for this)
  • I have something that searches for orfs and then returns an object orf_list which specifies their positions in the original search string and parses into codons. In other words, each orf is a list of codon objects which in turn contain the sequence and Aa translation of that codon. My mutation detection function does pairwise comparisons of strings to identify types of mutations when one is considered a refseq. It then returns a list of Mutation objects (...I went object crazy...) --Mike
  • I also have code that can take a sequence compare it against a reference and then return the found mutations (of the type of interest for the March 13th assignment) within the nucleic acid seuence as well as the protein sequence. The code could also be easily modified to determine whether a given amino acid change is 'significant' (hydrophilic->hydrophobic, etc.) --Resmi

Case 2: non-gene Blast hits

  • If there is no gene hit (like the example of 13 March, which was non-coding, supposedly), take the best_genomic_position
  • Again, locate mutations, but only in the nucleotide sequence (locate_noncoding_mutations(str, ref_str)) and also maybe do a tblastx (or just blastx) [hrm...is this too much?]
  • Fnd the IDs of known SNPs and CNVs, compare to what we have about our own sequence, and then search OMIM with this info (call the function omim_noncoding_search, with parameters TBD)
  • I think this would be an easier way to deal with OMIM since presumably if we parse the OMIM database into entries, they will be organized in this fashion → so less searching through entries until we find the specific one we are interested in. --Resmi

Perform OMIM search

  • Look up these mutations for the gene on OMIM (call this function omim_gene_search(genbank_id, muts), where muts is a list of mutations from 2a to look for; for genes they are listed in OMIM in the format {amino acid}{position}{amino acid} instead of {nucleotide}{position}{nucleotide}; see http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html for info on how to search all NCBI db's)
  • Yeah....I'm pretty against working with OMIM unless we can get a fast server to run this on and just let it sit... I still have to look at the eutils but that looks promising (Zach, you said it was already implemented in Biopython?)