Harvard:Biophysics 101/2007/Notebook:Xiaodi Wu/2007-3-20

From OpenWetWare
Revision as of 21:06, 17 March 2007 by Zsun (talk | contribs)
Jump to navigationJump to search

Hey all, here's a sketch of the algo/pseudocode I propose for this...feel free to flesh it out and modify/edit/discuss

1. take sequence: look up on Blast (see http://www.dalkescientific.com/writings/NBN/blast_searching.html); perhaps call this function sequence_lookup(str), returning some sort of object. for now, let's say the object includes best_gene_hit and best_genomic_position_hit, which includes chr (chromosome) and chrpos (position on chromosome), and the p-values for each match (or whatever they call them -- I think it might be called 'expected value')

2. if there is a gene hit (i.e. if sequence_lookup(str).best_gene_hit exists and has a p-value (expected value) above some threshold, then consider the given sequence to be part of a gene. then:

2a). translate gene (call this function translate_in_frame(str); I have an algo that goes through all the frames and finds the most likely ORF; works beautifully but a little slowly, but it will do -- we don't have to write this part of the algo), and locate mutations (locate_mutations(str, ref_str), returning a list (what in C would be an array -- I might slip into C lingo every so often so this is what I mean) containing the type of mutation (point mutation, insertion, deletion) in both a.a. and DNA sequence; again, I think we all have an algo for this)

2b). look up these mutations for the gene on OMIM (call this function omim_gene_search(genbank_id, muts), where muts is a list of mutations from 2a to look for; for genes they are listed in OMIM in the format {amino acid}{position}{amino acid} instead of {nucleotide}{position}{nucleotide}; see http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html for info on how to search all NCBI db's)

3. if there is no gene hit (like the example of 13 March, which was non-coding, supposedly), take the best_genomic_position

3a). again, locate mutations, but only in the nucleotide sequence (locate_noncoding_mutations(str, ref_str)) and also maybe do a tblastx (or just blastx) [hrm...is this too much?]

3b). look up the chromosome position in dbSNP and also this database, potentially: http://projects.tcag.ca/variation/

3c). find the IDs of known SNPs and CNVs, compare to what we have about our own sequence, and then search OMIM with this info (call the function omim_noncoding_search, with parameters TBD)