Harvard:Biophysics 101/2007/Notebook:Xiaodi Wu/2007-5-3: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(New page: == A proposal == Having acquired a hard-to-find but now easily-updatable source for local data on genes and their loci, it's now possible to work around the problem of not being able to r...)
(No difference)

Revision as of 16:16, 28 April 2007

A proposal

Having acquired a hard-to-find but now easily-updatable source for local data on genes and their loci, it's now possible to work around the problem of not being able to read images off of blast output. All we need from blast is a locus, which Katie's script does elegantly. Then, with this data:

  • Query our local database to ask what genes are in the locus (a simple MySQL query), and very very quick, I hope
  • Find the reference sequence (we also have a local copy of the genome, and know the exact locus--from which bp to which bp--from the blast query)
  • Compare (we've all written scripts for that)
  • Translate into protein if a coding sequence, otherwise come up with some other way of expressing this
  • Output the mutations (we also get an alignment back from blast...this is wonderful) using OMIM's notation, like this: [BRIP1, MET299ILE] (basically, [{gene name}, {amino acid}{position}{amino acid}]) and search in OMIM (we already have code for that, obviously)
  • Reap the benefits! (Also, compare it to dbSNP data.)

Does this sound like a clear and workable plan to people? Are there other considerations to be factored in?

A request

Related to what Katie has asked for in class, I've focused a lot on the first few stages of things, getting from sequence to OMIM. Regarding the 'reaping the benefits' part above, could people who've worked on the subsequent steps outline on their wiki page, and then link to their page from the class tasks list page, a sort of step-by-step accounting of what happens to this data after OMIM, and what sort of results we get, just like we started doing in class? There's a lot of code and work that's evidently be poured into this effort, but it's still somewhat unclear to me what exactly it all does...

A random thought

How is the idea of putting bioinformatics data into a database and then running a query on the database considered interesting enough for a paper in 2005? How?