Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-3: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
==Task Due Today== | ==Task Due Today== | ||
Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of my time to a task I would reasonably be able to complete by this date. The task I assumed was the elimination of 'false positives'--SNPs identified by the BLAST algorithm, which weren't actually found in the query sequence. I executed this task by placing additional constraints on current code to extract SNP data. Here is the code | Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of my time to a task I would reasonably be able to complete by this date. The task I assumed was the elimination of 'false positives'--SNPs identified by the BLAST algorithm, which weren't actually found in the query sequence. I executed this task by placing additional constraints on current code to extract SNP data. Here is the updated code: | ||
<pre> | <pre> | ||
# extracts snp data | # extracts snp data | ||
Line 80: | Line 80: | ||
No information found for rs7412 | No information found for rs7412 | ||
</pre> | </pre> | ||
The removed SNP is one that was identified by BLAST, but the SNP itself actually lay after the termination of the query sequence. So, it's good that we remove erroneous SNPs. |
Latest revision as of 22:42, 2 May 2007
Task Due Today
Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of my time to a task I would reasonably be able to complete by this date. The task I assumed was the elimination of 'false positives'--SNPs identified by the BLAST algorithm, which weren't actually found in the query sequence. I executed this task by placing additional constraints on current code to extract SNP data. Here is the updated code:
# extracts snp data def extract_snp_data(str): dom = parseString(str) variants = dom.getElementsByTagName("Hit") if len(variants) == 0: return parsed = [] for v in variants: # now populate the struct hit_def = get_text(v.getElementsByTagName("Hit_def")[0].childNodes) id_query = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes) id_hit = get_text(v.getElementsByTagName("Hsp_qseq")[0].childNodes) score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes) id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes) # extract position of the SNP from Hit Definition lower_bound = hit_def.find("pos=")+4 upper_bound = hit_def.find("len=")-1 position = int(hit_def[lower_bound:upper_bound]) # only consider it a genuine snp if the hit score is above 100, # the query/hit sequences are longer than the position of the SNP # and the query sequence matches the hit sequence at the SNP position if int(score) > 100 and position < len(id_hit) and id_query[position] == id_hit[position]: parsed.append(id) return parsed
I used our old friend, Apoe, and ran it through the old and updated versions of the group code. Here's the output from the old code:
Sequence received; please wait. 10 single nucleotide polymorphism(s) found: rs769455 rs28931578 rs28931577 rs11083750 rs28931576 rs429358 rs769452 rs28931579 rs7412 rs12982192 ---------------------------------------- No information found for rs769455 No information found for rs28931578 No information found for rs28931577 No information found for rs11083750 No information found for rs28931576 No information found for rs429358 No information found for rs769452 No information found for rs28931579 No information found for rs7412 No information found for rs12982192
Here's the output using the code I updated:
Sequence received; please wait. 9 single nucleotide polymorphism(s) found: rs769455 rs28931578 rs28931577 rs11083750 rs28931576 rs429358 rs769452 rs28931579 rs7412 ---------------------------------------- No information found for rs769455 No information found for rs28931578 No information found for rs28931577 No information found for rs11083750 No information found for rs28931576 No information found for rs429358 No information found for rs769452 No information found for rs28931579 No information found for rs7412
The removed SNP is one that was identified by BLAST, but the SNP itself actually lay after the termination of the query sequence. So, it's good that we remove erroneous SNPs.