Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-3: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
m (New page: ==Tasks Due Today== Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of ...)
 
mNo edit summary
Line 25: Line 25:
# and the query sequence matches the hit sequence at the SNP position
# and the query sequence matches the hit sequence at the SNP position
if int(score) > 100 and position >= len(id_hit):
if int(score) > 100 and position >= len(id_hit):
if id_query == id_hit: parsed.append(id_hit)
if id_query == id_hit: parsed.append(id)
return parsed
return parsed
</pre>
</pre>


I used our old friend, [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=K00396 Apoe],
I used our old friend, [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=K00396 Apoe], and ran it through the old and updated versions of the group code.  Here's the output from the old code:
<pre>

Revision as of 22:01, 2 May 2007

Tasks Due Today

Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of my time to a task I would reasonably be able to complete by this date. The task I assumed was the elimination of 'false positives'--SNPs identified by the BLAST algorithm, which weren't actually found in the query sequence. I executed this task by placing additional constraints on current code to extract SNP data. Here is the code I updated:

# extracts snp data
def extract_snp_data(str):
	dom = parseString(str)
	variants = dom.getElementsByTagName("Hit")
	if len(variants) == 0:
		return
	parsed = []
	for v in variants:
		# now populate the struct
		hit_def = get_text(v.getElementsByTagName("Hit_def")[0].childNodes)
		id_query = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes)
		id_hit = get_text(v.getElementsByTagName("Hsp_qseq")[0].childNodes)
		score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes)
		id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes)
		# extract position of the SNP from Hit Definition		
		lower_bound = hit_def.find("pos=")+4
		upper_bound = hit_def.find("len=")-1
		position = int(Hit_def[lower_bound:upper_bound])
		# only consider it a genuine snp if the hit score is above 100,
		# the query/hit sequences are longer than the position of the SNP
		# and the query sequence matches the hit sequence at the SNP position
		if int(score) > 100 and position >= len(id_hit):
			if id_query == id_hit: parsed.append(id)
	return parsed

I used our old friend, Apoe, and ran it through the old and updated versions of the group code. Here's the output from the old code: