Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-3: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
m (New page: ==Tasks Due Today== Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of ...)
(No difference)

Revision as of 21:59, 2 May 2007

Tasks Due Today

Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of my time to a task I would reasonably be able to complete by this date. The task I assumed was the elimination of 'false positives'--SNPs identified by the BLAST algorithm, which weren't actually found in the query sequence. I executed this task by placing additional constraints on current code to extract SNP data. Here is the code I updated:

# extracts snp data
def extract_snp_data(str):
	dom = parseString(str)
	variants = dom.getElementsByTagName("Hit")
	if len(variants) == 0:
		return
	parsed = []
	for v in variants:
		# now populate the struct
		hit_def = get_text(v.getElementsByTagName("Hit_def")[0].childNodes)
		id_query = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes)
		id_hit = get_text(v.getElementsByTagName("Hsp_qseq")[0].childNodes)
		score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes)
		id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes)
		# extract position of the SNP from Hit Definition		
		lower_bound = hit_def.find("pos=")+4
		upper_bound = hit_def.find("len=")-1
		position = int(Hit_def[lower_bound:upper_bound])
		# only consider it a genuine snp if the hit score is above 100,
		# the query/hit sequences are longer than the position of the SNP
		# and the query sequence matches the hit sequence at the SNP position
		if int(score) > 100 and position >= len(id_hit):
			if id_query == id_hit: parsed.append(id_hit)
	return parsed

I used our old friend, Apoe,