Harvard:Biophysics 101/Notebook:ZS/2007-4-22: Difference between revisions
From OpenWetWare
Jump to navigationJump to search
(New page: ==Tasks for this Tues/Thurs== #Advance CDC prevalence parsing program - though I don't think that needs more work #I need a new direction for work - are there any tasks that need to be co...) |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
==Tasks for this Tues/Thurs== | ==Tasks for this Tues/Thurs== | ||
#Advance CDC prevalence parsing program - though I don't think that needs more work | #Advance CDC prevalence parsing program - though I don't think that needs more work (edit: done below) | ||
#I need a new direction for work - are there any tasks that need to be completed? | #I need a new direction for work - are there any tasks that need to be completed? | ||
#*I was thinking of ways to begin tackling the not-in-OMIM case - I can do some code consolidation for sequences that are CDS but not documented, and pursue the AA change/frameshift/etc... type analysis we tackled earlier in class into something coherent. | #*I was thinking of ways to begin tackling the not-in-OMIM case - I can do some code consolidation for sequences that are CDS but not documented, and pursue the AA change/frameshift/etc... type analysis we tackled earlier in class into something coherent. | ||
==Update: Tues== | |||
after discussion I have decided to push back code consolidation, and focus on a more meaningful task Xiaodi and Katie have been having trouble with - finding out of a sequence is in a CDS or not. I have some ideas for parsing BLAST data (which prob. wont work) but also going through Entrez Gene which I've been able to determine gene range WITHOUT parsing the image - I think it may work after some experimentation. | |||
==Slight modification to prev. program== | |||
<pre> | |||
#ICD9_prevalance.py | |||
#Zachary Sun | |||
#4.23.07, Biophysics 101 | |||
#This code does a couple of things: | |||
#1) Enables lookup of ICD9 ID numbers when given a search term | |||
# *This is particularly useful as all data (WHO, CDC) is based | |||
# *on the ICD9 ID, which is a method of classifying all known | |||
# *diseases. This code is able to queue a website, http://icd9cm.chrisendres.com | |||
# *which has the database of diseases - it returns the best hits | |||
# *in dirty HTML, which can then be parsed to obtain ICD9 #'s and description. | |||
#2) Enables lookup of prevalance data in databased | |||
# *Currently, I only have it hooked up to the State of CA prevalance data from | |||
# *http://www.oshpd.ca.gov/hqad/PatientLevel/ICD9_Codes/index.htm, it does a | |||
# *lookup based on #1 and returns prevalance data. As soon as I find more databases | |||
# *I can extend the search; I haven't found much good though save the main CDC db | |||
# *at http://wonder.cdc.gov/. | |||
#To do: clean up output | |||
import os | |||
import string | |||
import urllib | |||
disease_name = "asthma" #INSERT DISEASE NAME HERE FOR TESTING | |||
ICD9code = [] | |||
found = 0 | |||
#### | |||
#queueing http://icd9cm.chrisendres.com for code lookup | |||
#### | |||
queue_name = 'http://icd9cm.chrisendres.com/index.php?action=search&srchtype=diseases&srchtext=' | |||
queue_name = queue_name + disease_name | |||
code_lookup = urllib.urlopen(queue_name).read() #send queue request to site, returns dirty html | |||
out = open("index.txt", "w") #write dirty html to file | |||
out.write(code_lookup) | |||
out.close() | |||
readCode = open("index.txt", "r") #read dirty html | |||
lookup_line = readCode.readline() | |||
print "***ICD9 and hits, arranged by importance***\n" | |||
while lookup_line: | |||
w= lookup_line.find("<div class=dlvl>") #the unique marker before the disease | |||
if w != -1: #if it is found | |||
tempCode = lookup_line[32:40] #code in this section | |||
tempCode = string.split(tempCode, ' ') #split into number | |||
ICD9code.append(tempCode[0]) | |||
print lookup_line[32:len(lookup_line)-7] #disable in final version - shows the hits found | |||
lookup_line = readCode.readline() | |||
#### | |||
#searching incidence data in CA data file | |||
#note: returns class of hits | |||
#### | |||
print "\n\n***Prevalance data per IDC9 above***" | |||
print "Note: returns incidence data/yr, including subclasses\n" | |||
fh = open(os.path.join(os.curdir, "dx05.txt")) #the CA data file | |||
for code in ICD9code: | |||
fh.seek(0) | |||
line = fh.readline() | |||
totalIncidence = 0 | |||
sumCount = 0 | |||
while line: #for every line in the data | |||
line = line[:-1] #remove /n | |||
lineVector = string.split(line, ',') #split to vector | |||
if lineVector[0].find(code) != -1: #if disease hit | |||
totalIncidence += int(lineVector[1]) | |||
sumCount += 1 | |||
else: | |||
if sumCount > 0: | |||
print "ID: ", code, "incidence: ", totalIncidence | |||
totalIncidence = 0 | |||
sumCount = 0 | |||
line = fh.readline() | |||
</pre> | |||
output: | |||
<pre> | |||
***ICD9 and hits, arranged by importance*** | |||
493.0 Extrinsic asthma | |||
493.1 Intrinsic asthma | |||
493.9 Asthma, unspecified | |||
495.8 Other specified allergic alveolitis and pneumonitis | |||
493.2 Chronic obstructive asthma | |||
V17.5 Asthma</div> | |||
493.8 Other forms of asthma</div> | |||
493.82 Cough variant asthma</div> | |||
507.8 Due to other solids and liquids | |||
786.07 Wheezing | |||
***Prevalance data per IDC9 above*** | |||
Note: returns incidence data/yr, including subclasses | |||
ID: 493.0 incidence: 13694 | |||
ID: 493.1 incidence: 148 | |||
ID: 493.9 incidence: 150445 | |||
ID: 495.8 incidence: 38 | |||
ID: 493.2 incidence: 50978 | |||
ID: V17.5 incidence: 1766 | |||
ID: 493.8 incidence: 368 | |||
ID: 493.82 incidence: 36 | |||
ID: 507.8 incidence: 232 | |||
ID: 786.07 incidence: 932 | |||
</pre> |
Latest revision as of 23:23, 24 April 2007
Tasks for this Tues/Thurs
- Advance CDC prevalence parsing program - though I don't think that needs more work (edit: done below)
- I need a new direction for work - are there any tasks that need to be completed?
- I was thinking of ways to begin tackling the not-in-OMIM case - I can do some code consolidation for sequences that are CDS but not documented, and pursue the AA change/frameshift/etc... type analysis we tackled earlier in class into something coherent.
Update: Tues
after discussion I have decided to push back code consolidation, and focus on a more meaningful task Xiaodi and Katie have been having trouble with - finding out of a sequence is in a CDS or not. I have some ideas for parsing BLAST data (which prob. wont work) but also going through Entrez Gene which I've been able to determine gene range WITHOUT parsing the image - I think it may work after some experimentation.
Slight modification to prev. program
#ICD9_prevalance.py #Zachary Sun #4.23.07, Biophysics 101 #This code does a couple of things: #1) Enables lookup of ICD9 ID numbers when given a search term # *This is particularly useful as all data (WHO, CDC) is based # *on the ICD9 ID, which is a method of classifying all known # *diseases. This code is able to queue a website, http://icd9cm.chrisendres.com # *which has the database of diseases - it returns the best hits # *in dirty HTML, which can then be parsed to obtain ICD9 #'s and description. #2) Enables lookup of prevalance data in databased # *Currently, I only have it hooked up to the State of CA prevalance data from # *http://www.oshpd.ca.gov/hqad/PatientLevel/ICD9_Codes/index.htm, it does a # *lookup based on #1 and returns prevalance data. As soon as I find more databases # *I can extend the search; I haven't found much good though save the main CDC db # *at http://wonder.cdc.gov/. #To do: clean up output import os import string import urllib disease_name = "asthma" #INSERT DISEASE NAME HERE FOR TESTING ICD9code = [] found = 0 #### #queueing http://icd9cm.chrisendres.com for code lookup #### queue_name = 'http://icd9cm.chrisendres.com/index.php?action=search&srchtype=diseases&srchtext=' queue_name = queue_name + disease_name code_lookup = urllib.urlopen(queue_name).read() #send queue request to site, returns dirty html out = open("index.txt", "w") #write dirty html to file out.write(code_lookup) out.close() readCode = open("index.txt", "r") #read dirty html lookup_line = readCode.readline() print "***ICD9 and hits, arranged by importance***\n" while lookup_line: w= lookup_line.find("<div class=dlvl>") #the unique marker before the disease if w != -1: #if it is found tempCode = lookup_line[32:40] #code in this section tempCode = string.split(tempCode, ' ') #split into number ICD9code.append(tempCode[0]) print lookup_line[32:len(lookup_line)-7] #disable in final version - shows the hits found lookup_line = readCode.readline() #### #searching incidence data in CA data file #note: returns class of hits #### print "\n\n***Prevalance data per IDC9 above***" print "Note: returns incidence data/yr, including subclasses\n" fh = open(os.path.join(os.curdir, "dx05.txt")) #the CA data file for code in ICD9code: fh.seek(0) line = fh.readline() totalIncidence = 0 sumCount = 0 while line: #for every line in the data line = line[:-1] #remove /n lineVector = string.split(line, ',') #split to vector if lineVector[0].find(code) != -1: #if disease hit totalIncidence += int(lineVector[1]) sumCount += 1 else: if sumCount > 0: print "ID: ", code, "incidence: ", totalIncidence totalIncidence = 0 sumCount = 0 line = fh.readline()
output:
***ICD9 and hits, arranged by importance*** 493.0 Extrinsic asthma 493.1 Intrinsic asthma 493.9 Asthma, unspecified 495.8 Other specified allergic alveolitis and pneumonitis 493.2 Chronic obstructive asthma V17.5 Asthma</div> 493.8 Other forms of asthma</div> 493.82 Cough variant asthma</div> 507.8 Due to other solids and liquids 786.07 Wheezing ***Prevalance data per IDC9 above*** Note: returns incidence data/yr, including subclasses ID: 493.0 incidence: 13694 ID: 493.1 incidence: 148 ID: 493.9 incidence: 150445 ID: 495.8 incidence: 38 ID: 493.2 incidence: 50978 ID: V17.5 incidence: 1766 ID: 493.8 incidence: 368 ID: 493.82 incidence: 36 ID: 507.8 incidence: 232 ID: 786.07 incidence: 932