# Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-3

Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of my time to a task I would reasonably be able to complete by this date. The task I assumed was the elimination of 'false positives'--SNPs identified by the BLAST algorithm, which weren't actually found in the query sequence. I executed this task by placing additional constraints on current code to extract SNP data. Here is the updated code:

```# extracts snp data
def extract_snp_data(str):
dom = parseString(str)
variants = dom.getElementsByTagName("Hit")
if len(variants) == 0:
return
parsed = []
for v in variants:
# now populate the struct
hit_def = get_text(v.getElementsByTagName("Hit_def")[0].childNodes)
id_query = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes)
id_hit = get_text(v.getElementsByTagName("Hsp_qseq")[0].childNodes)
score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes)
id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes)
# extract position of the SNP from Hit Definition
lower_bound = hit_def.find("pos=")+4
upper_bound = hit_def.find("len=")-1
position = int(hit_def[lower_bound:upper_bound])
# only consider it a genuine snp if the hit score is above 100,
# the query/hit sequences are longer than the position of the SNP
# and the query sequence matches the hit sequence at the SNP position
if int(score) > 100 and position < len(id_hit) and id_query[position] == id_hit[position]:
parsed.append(id)
return parsed
```

I used our old friend, Apoe, and ran it through the old and updated versions of the group code. Here's the output from the old code:

```Sequence received; please wait.
10 single nucleotide polymorphism(s) found:
rs769455
rs28931578
rs28931577
rs11083750
rs28931576
rs429358
rs769452
rs28931579
rs7412
rs12982192
----------------------------------------
No information found for rs769455
No information found for rs28931578
No information found for rs28931577
No information found for rs11083750
No information found for rs28931576
No information found for rs429358
No information found for rs769452
No information found for rs28931579
No information found for rs7412
No information found for rs12982192
```

Here's the output using the code I updated:

```Sequence received; please wait.
9 single nucleotide polymorphism(s) found:
rs769455
rs28931578
rs28931577
rs11083750
rs28931576
rs429358
rs769452
rs28931579
rs7412
----------------------------------------
No information found for rs769455
No information found for rs28931578
No information found for rs28931577
No information found for rs11083750
No information found for rs28931576
No information found for rs429358
No information found for rs769452
No information found for rs28931579
No information found for rs7412
```

The removed SNP is one that was identified by BLAST, but the SNP itself actually lay after the termination of the query sequence. So, it's good that we remove erroneous SNPs.