Harvard:Biophysics 101/2007/Project: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
m (oops, bad link)
 
(44 intermediate revisions by 8 users not shown)
Line 1: Line 1:
{{Template:Harvard_Biophysics_101:2007}}
{{Template:Harvard_Biophysics_101:2007}}
<div style="padding: 10px; width: 720px; border: 5px solid #DDDDFF;">
<div style="padding: 10px; width: 720px; border: 5px solid #DDDDFF;">
= [http://snp.med.harvard.edu/rosencrantz/ Completed HTML Interface] =
'''Screenshots'''
<gallery>
Image:070503_screenshot1.png|Main input page
Image:070503_screenshot2.png|Example output
</gallery>


= Overview =
= Overview =
*Project Goal: Development of tools to aid in analysis of personal DNA sequences.   
*Project Goal: To develop tools to aid in analysis of personal DNA sequences.   
We would like to develop software and documentation that will help people get from sequence to diagnosis.  At the moment, we are focusing on identifying and classifying SNPs, but we will broaden this identification to other things like large deletions or insertions or repeats when we have more expertise.  We are attempting to harness the power of other already existing tools, and we would also like to make this tool one that others can build upon.  Specifically, our program will eventually be able to determine location based on BLAST, determine any SNPs based on NCBI SNP, and give a prognosis based on OMIM and online medical databases.
We would like to develop software and documentation that will help people get from sequence to diagnosis.  At the moment, we are focusing on identifying and classifying SNPs, but we will broaden this identification to other things like large deletions or insertions or repeats when we have more expertise.  We are attempting to harness the power of other already existing tools, and we would also like to make this tool one that others can build upon.  Specifically, our program will eventually be able to determine location based on BLAST, determine any SNPs based on NCBI SNP, and give a prognosis based on OMIM and online medical databases.


=Project Sections=
= Project Summary =
* ATTENTION: Everyone needs to post their code to one placeLet's say everyone post a link from here that works to their code and then I'll be able to combine it all. --[[User:Kfifer|Katie Fifer]]
 
* Could someone who typed this up today please add the other sections that are being worked on? '''--[[User:TChan|TChan]], 12:47 20 March 2007'''
This code takes the input from the user (a codon sequence) which is searched against the human database to look for SNPs (Single Nucleotide Polymorphisms)This codon sequence is found in BLAST SNP, and any SNPs are reported as an RS number<sup>1</sup>. The query is compared against the sequence in dbSNP to determine if the sequence is really a mutation; if this test passes, the RS number is then is used to generate mesh terms from PubMed, and determines which mesh terms are the most relevant.  The potential disease and the prevalence of this disease (derived from the California State Prevalence data) are extracted from the most pertinent mesh terms.  These mesh terms are then used to get updated news regarding the disease
 
<sup>1</sup>Aside: one portion of the code accesses BLAST SNP without using RS numbers.  The information acquired through this the BLAST SNP website is queried in OMIM, and the output is as follows: disease name, mutation, and name of the mutation.  The disease name is then used to search different websites for drugs, procedure, and experts regarding the disease, and is used to provide a list of web pages with the disease name searched.
 
=Path 1: Using BlastSNP=
 
[[Harvard:Biophysics_101/2007/Project/Inputs/Outputs|Inputs & Outputs]]
 
==[1]. Sequence &rarr; RS Number (BLAST SNP)&rarr; Disease Name (OMIM)==
* Xiaodi: [[Harvard:Biophysics_101/2007/Notebook:Katie_Fifer/2007-4-15|code]] prints and outputs to file pickled version of the OMIM output for our sequence
* Chris: quality checking [[Harvard:Biophysics_101/2007/Notebook:Christopher_Nabel/2007-4-24|code]] returns sequences back after dbSNP
* Kay: [[Harvard:Biophysics_101/2007/Notebook:Kaull/2007-5-3|Work-around]] for SNPs non-directly linked to OMIM
 
==[2]. RS Number &rarr; PubMed Articles &rarr; Mesh Terms ==
* Cynthia [[Harvard:Biophysics_101/2007/Notebook:CChi/2007-5-1|code]] (1) input: rs#, (2) output: list of mesh terms (3) prints top 5 PubMed articles, print list of mesh terms
 
 
==[3]. RS Number (OMIM) &rarr; PubMed PMIDs &rarr; Mesh Terms==
* Resmi: [[Harvard:Biophysics_101/2007/Notebook:Resmi_Charalel/2007-4-26|code]] (1) parses OMIM XML to (2) get PubMed PMIDs to (3) get Mesh Terms
 
 
==[4]. Mesh Terms &rarr; Disease Name (in-house code)==
* Deniz
* Zach: [[Harvard:Biophysics_101/Notebook:ZS/2007-5-3 | code]] finds mesh codes(cynthia), determines most important mesh codes, finds prevalence, find CDC ID and formal disease name.b
 
==[5]. Disease Name &rarr; Prevalence (&rarr; regulates Useful Patient Info)==
* Zach: see above, complete and compiled


:* The following is from the information that Zach typed up in class (Located  [[Harvard:Biophysics_101/2007/Notebook:Xiaodi_Wu/2007-3-20 | here]]) '''--[[User:Hetmann|Hetmann]],
==[6]. Disease Name &rarr; Useful Patient Info (Medical search engines)==
5:32 20 March 2007'''
* Deniz: [[Harvard:Biophysics_101/2007/Notebook:Denizkural/2007-4-3|examples of coding]] which returns news updates in several formats
* Tiff: [[TChan/Notebook/2007-5-2|code]] (a) displays relevant URLs (, (b) returns lists of drugs, clinical trials, experts, etc.
* Tiff: [[TChan/Notebook/2007-5-3#Allelic_Frequency|code]] outputs allelic frequency data parsed from dbSNP (what we actually want to show to the user must be selected out of the data)
* Resmi: [[Harvard:Biophysics_101/2007/Notebook:Resmi_Charalel/2007-4-5|code]] displays PubMed review citations in text form
* Cynthia: [[Harvard:Biophysics_101/2007/Notebook:CChi/2007-4-24|code]] displays PubMed article citations in text form


:* Editted to add some of my own notes and to reflect some semblance of order.. --[[User:Cchi|Cchi]] 10:00, 22 March 2007 (EDT)
==[7]. All displayable data &rarr; Web Interface==
* Xiaodi and Katie: [[Harvard:Biophysics_101/2007/Notebook:Xiaodi_Wu/2007-4-17|code and description]] displays (some - still need to integrate some remaining scripts) preceding data in a web interface


==Integration==
=Path 2: Using GenBank and PolyPhen=
* Katie
* PM / encourage documentation


==Sequence to BLAST SNP to rs#==
==[1]. Sequence &rarr; Gene Name (BLAST)==
'''Update:''' a script from BLAST SNP to OMIM is now working in its entirety; see [[Harvard:Biophysics_101/2007/Notebook:Xiaodi_Wu/2007-4-5|here]] --[[User:Wuxiaodi|wuxiaodi]] 22:25, 5 April 2007 (EDT)
* Zach [[Harvard:Biophysics_101/Notebook:ZS/2007-5-2|code]] returns through BLAST HTML parse if is in CDS or not, and for former the CDS , asc#, and seq, and for latter the seq. info and surrounding genes
* Katie [[Harvard:Biophysics_101/2007/Notebook:Katie_Fifer/2007-4-19|code]] returns BLAST XML data.


* Zach, Mike, and Tiffany
==[2]. Gene Name &rarr; Mutations (GenBank)==
'''BioPython Modification'''
* Xiaodi
* Parsing XML of Biopython BLAST - Deniz
* Chris: quality checking [[Harvard:Biophysics_101/2007/Notebook:Christopher_Nabel/2007-4-17|code]] detects mismatches between our sequence and returned SNPs
* Relevant file: [[Image:NCBIWWWforBLASTSNPmodif-TC32007.py|Python25/Lib/site-packages/Bio/Blast/NCBIWWW.py]]
* Discussion on BLAST SNP can proceed on the [[Talk:Harvard:Biophysics_101/2007/Project|discussion page]].




'''Accessing BLAST SNP using URLAPI'''
==[3]. Gene Name &rarr; ??? (PolyPhen)==
*To access snp blast database using BLAST URLAPI, you only need to provide the "DATABASE" parameter an appropriate value.
* Mike
*The path and name of SNP blast databases available to URLAPI and blastcl3 client are documented at http://www.ncbi.nih.gov/staff/tao/URLAPI/remote_accessible_blastdblist.html#8
*For more information on URLAPI, please see: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/


'''Info from BioPython (via Zach)'''
*It looks like that if you know the name of the database (here "snp/human_9606/human_9606"), then you can run for example
from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast("blastn", "snp/human_9606/human_9606", seq) and then parse the results as usual (see section 3.4 in the Biopython tutorial).                                                                                               
*Database names can be found here: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdblist.html                                                                                                       
*The result_handle should give you the same information as the web page, and Biopython's parser should parse all information from result_handle correctly. If you find that some information seems to be missing, please let us know.


'''--[[User:ShawnDouglas|smd]] 18:05, 22 March 2007 (EDT)'''
==[4]. Mutations &rarr; Disease Name (OMIM)==
* Xiaodi


==OMIM XML Parse==
* Xiaodi - completed? Yup: [[Harvard:Biophysics_101/2007/Notebook:Xiaodi_Wu/2007-4-2|here]]
* rs -> OMIM XML parse -> phenotype text


==OMIM==
==[5]. Disease Name &rarr; Useful Patient Info (Medical search engines)==
* Resmi, Cynthia, and Hetmann
* Deniz:
* Handling the text from the parse: [[Harvard:Biophysics_101/2007/Notebook:Resmi_Charalel/2007-4-5|OMIM output to list of PubMed Review Articles]] <-The code on this page will print out review articles that might be relevant to the disease retruned by OMIM.
**examples of code which returns [[Harvard:Biophysics_101/2007/Notebook:Denizkural/2007-4-3|Google news updates]] and [[Harvard:Biophysics_101/2007/Notebook:Denizkural/2007-4-24|MedStory news updates]] in several formats
**blogs through technorati
**clinical trials by disease name and by drug
**medical articles and resources
**research articles
* Tiff: [[TChan/Notebook/2007-5-2|code]] (a) displays relevant URLs, (b) returns lists of drugs, clinical trials, experts, etc.
* Tiff: [[TChan/Notebook/2007-5-3#Allelic_Frequency|code]] outputs allelic frequency data parsed from dbSNP (what we actually want to show to the user must be selected out of the data)
* Resmi: [[Harvard:Biophysics_101/2007/Notebook:Resmi_Charalel/2007-4-5|code]] displays PubMed review citations in text form
* Cynthia: [[Harvard:Biophysics_101/2007/Notebook:CChi/2007-4-24|code]] displays PubMed article citations in text form


'''Controlled Vocabulary for parsing OMIM records'''
==[6]. All displayable data &rarr; Web Interface==
*[http://www.biomedcentral.com/1471-2105/6/S4/S18 Masseroli et al.]:  "Our efforts to derive from the OMIM entries a controlled vocabulary of phenotype locations and descriptions enabled us to normalize and structure the valuable OMIM phenotypic data according to the obtained vocabulary and make them suitable for computational use. Although detailed phenotype descriptions could be further homogenized and standardized, their subdivision in hierarchical levels of detail that we performed allows to group specific phenotypes according to their common general traits, without loosing their specific characteristics. So, for example "Mental retardation, moderate" and "Mental retardation, nonspecific" can be both generally considered as "Mental retardation" and at the same time they can be treated as different types of mental defects. This provides the chance to modulate analysis granularity when searching for phenotypic traits shared among multiple diseases or genotypes. It also ensures more significant and clear results when categorical statistical analyses are performed at lower granularity levels of detail. Such interesting feature, proper of the hierarchical structure and hence belonging also to the defined phenotype location hierarchy, is exploited in the new GFINDer Genetic Disorders modules implemented for the study of genetic disorder related genes." 
* Xiaodi and Katie: [[Harvard:Biophysics_101/2007/Notebook:Xiaodi_Wu/2007-4-17|code and description]] displays (some - still need to integrate some remaining scripts) preceding data in a web interface
*http://promoter.bioing.polimi.it/gfinder/Phenotypes.txt
* Deniz - write the output HTML for news, drugs, research articles, and all the other output.
*http://promoter.bioing.polimi.it/gfinder/Phenotype_Locations.txt
'''—[[User:ShawnDouglas|smd]] 13:19, 22 March 2007 (EDT)'''


==Beyond OMIM==
=Diagram=
* Tiffany, Resmi, Deniz, Xiaodi, Mike, Chris (''note: ask if API exists'')
[[Image:BioPhys101_Project_Diagram_4_30_07.jpg|650px|Project Diagram]]
:* Wikipedia (Mike) http://meta.wikimedia.org/wiki/API
:* Webmd (Tiff)
:* Emedicine (Resmi)
:* Google, Medstory. (Deniz)
:** Google, Part 1: latest news for our OMIM phrases, [[Harvard:Biophysics_101/2007/Notebook:Denizkural/2007-4-3|explanation]]
:* Linking out of XML (Xiaodi)
:* MedStory (Mike?)
:* Pubmed (Chris)
:* Downloading OMIM, extra functionalities, Eutils (Deniz)


==Multiple SNPs==
* Chris, Deniz
* figure out with of multiple SNPs are relevant
Kay


==Unassigned==
=Project-in-Progress=
#not in SNP db... then what? - I'd like to point out new efforts that aim to replace OMIM, called the "Human Variome Project" -- Deniz
Project-in-Progress notes have been moved to their [[Harvard:Biophysics_101/2007/Project_in_Progress|own page]].
#OMIM DOA
#systematically nonsyn. -> mutation not in OMIM or dbSNP?
#other dbs: genecard (spec. conservation, pop. freq)
#looking into linking gene expression w/ GEO?


=Project Ideas=
=Project Ideas=
Project ideas have been moved to their [[Harvard:Biophysics_101/2007/ProjectIdeas|own page]].
Project ideas have been moved to their [[Harvard:Biophysics_101/2007/ProjectIdeas|own page]].

Latest revision as of 06:10, 3 May 2007

Biophysics 101: Genomics, Computing, and Economics

Home        People        Schedule        Project        Python        Help       

Completed HTML Interface

Screenshots

Overview

  • Project Goal: To develop tools to aid in analysis of personal DNA sequences.

We would like to develop software and documentation that will help people get from sequence to diagnosis. At the moment, we are focusing on identifying and classifying SNPs, but we will broaden this identification to other things like large deletions or insertions or repeats when we have more expertise. We are attempting to harness the power of other already existing tools, and we would also like to make this tool one that others can build upon. Specifically, our program will eventually be able to determine location based on BLAST, determine any SNPs based on NCBI SNP, and give a prognosis based on OMIM and online medical databases.

Project Summary

This code takes the input from the user (a codon sequence) which is searched against the human database to look for SNPs (Single Nucleotide Polymorphisms). This codon sequence is found in BLAST SNP, and any SNPs are reported as an RS number1. The query is compared against the sequence in dbSNP to determine if the sequence is really a mutation; if this test passes, the RS number is then is used to generate mesh terms from PubMed, and determines which mesh terms are the most relevant. The potential disease and the prevalence of this disease (derived from the California State Prevalence data) are extracted from the most pertinent mesh terms. These mesh terms are then used to get updated news regarding the disease

1Aside: one portion of the code accesses BLAST SNP without using RS numbers. The information acquired through this the BLAST SNP website is queried in OMIM, and the output is as follows: disease name, mutation, and name of the mutation. The disease name is then used to search different websites for drugs, procedure, and experts regarding the disease, and is used to provide a list of web pages with the disease name searched.

Path 1: Using BlastSNP

Inputs & Outputs

[1]. Sequence → RS Number (BLAST SNP)→ Disease Name (OMIM)

  • Xiaodi: code prints and outputs to file pickled version of the OMIM output for our sequence
  • Chris: quality checking code returns sequences back after dbSNP
  • Kay: Work-around for SNPs non-directly linked to OMIM

[2]. RS Number → PubMed Articles → Mesh Terms

  • Cynthia code (1) input: rs#, (2) output: list of mesh terms (3) prints top 5 PubMed articles, print list of mesh terms


[3]. RS Number (OMIM) → PubMed PMIDs → Mesh Terms

  • Resmi: code (1) parses OMIM XML to (2) get PubMed PMIDs to (3) get Mesh Terms


[4]. Mesh Terms → Disease Name (in-house code)

  • Deniz
  • Zach: code finds mesh codes(cynthia), determines most important mesh codes, finds prevalence, find CDC ID and formal disease name.b

[5]. Disease Name → Prevalence (→ regulates Useful Patient Info)

  • Zach: see above, complete and compiled

[6]. Disease Name → Useful Patient Info (Medical search engines)

  • Deniz: examples of coding which returns news updates in several formats
  • Tiff: code (a) displays relevant URLs (, (b) returns lists of drugs, clinical trials, experts, etc.
  • Tiff: code outputs allelic frequency data parsed from dbSNP (what we actually want to show to the user must be selected out of the data)
  • Resmi: code displays PubMed review citations in text form
  • Cynthia: code displays PubMed article citations in text form

[7]. All displayable data → Web Interface

  • Xiaodi and Katie: code and description displays (some - still need to integrate some remaining scripts) preceding data in a web interface

Path 2: Using GenBank and PolyPhen

[1]. Sequence → Gene Name (BLAST)

  • Zach code returns through BLAST HTML parse if is in CDS or not, and for former the CDS , asc#, and seq, and for latter the seq. info and surrounding genes
  • Katie code returns BLAST XML data.

[2]. Gene Name → Mutations (GenBank)

  • Xiaodi
  • Chris: quality checking code detects mismatches between our sequence and returned SNPs


[3]. Gene Name → ??? (PolyPhen)

  • Mike


[4]. Mutations → Disease Name (OMIM)

  • Xiaodi


[5]. Disease Name → Useful Patient Info (Medical search engines)

  • Deniz:
    • examples of code which returns Google news updates and MedStory news updates in several formats
    • blogs through technorati
    • clinical trials by disease name and by drug
    • medical articles and resources
    • research articles
  • Tiff: code (a) displays relevant URLs, (b) returns lists of drugs, clinical trials, experts, etc.
  • Tiff: code outputs allelic frequency data parsed from dbSNP (what we actually want to show to the user must be selected out of the data)
  • Resmi: code displays PubMed review citations in text form
  • Cynthia: code displays PubMed article citations in text form

[6]. All displayable data → Web Interface

  • Xiaodi and Katie: code and description displays (some - still need to integrate some remaining scripts) preceding data in a web interface
  • Deniz - write the output HTML for news, drugs, research articles, and all the other output.

Diagram

Project Diagram


Project-in-Progress

Project-in-Progress notes have been moved to their own page.

Project Ideas

Project ideas have been moved to their own page.