Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-1: Difference between revisions
From OpenWetWare
Jump to navigationJump to search
m (New page: ==Tasks Due Today== ===Change parse code to condense rsIDs and SNP seq lists from XML into a single dict file (appropriate input for code removing false SNPs)=== * This code should replace...) |
mNo edit summary |
||
Line 60: | Line 60: | ||
#print '-' * 40 | #print '-' * 40 | ||
#print "\n" | #print "\n" | ||
</pre> | |||
===Complete code which removes falsely-identified SNPs=== | |||
<pre> | |||
# Mike and I are writing this code in tandem. The goal of this code | |||
# is to remove falsely identified SNPs (we will integrate our two halves | |||
# in class tomorrow). | |||
# Parts go as follows: | |||
# 1. take input (dict file with rsID and keyed seq) | |||
# 2. align the SNP with reference sequence, | |||
# 3. find ambiguous base-pair position. | |||
# 4. check to see if the sequence BP is in the space of the SNP BP. | |||
# 5. remove erroneous SNP entries | |||
# 6. return output (updated dict) | |||
# I am doing parts 4-6. | |||
# given pos from part 3 | |||
for i in dict_file.iterkeys(): | |||
a=seq[pos] | |||
snpseq=dict_file.get(i) | |||
j = 0 | |||
k = 0 | |||
while j = 0: | |||
if snpseq[k] in IUPACData.ambiguous_dna_letters and not in IUPACData.unambiguous_dna_letters: | |||
b=snpseq[j] | |||
j = 1 | |||
else: k=k+1 | |||
b_values = IUPACData.ambiguous_dna_values[b] | |||
if a not in b_values: | |||
del dict_file[i] | |||
return dict_file | |||
</pre> | </pre> |
Revision as of 23:11, 30 April 2007
Tasks Due Today
Change parse code to condense rsIDs and SNP seq lists from XML into a single dict file (appropriate input for code removing false SNPs)
- This code should replace Xiaodi's section, which bears the same name.
# extracts snp data def extract_snp_data(str): dom = parseString(str) variants = dom.getElementsByTagName("Hit") if len(variants) == 0: return parsed = {} for v in variants: # now populate the struct id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes) score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes) parsed_seq = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes) # only consider it a genuine snp if the hit score is above 100 if int(score) > 100: parsed[id] = parsed_seq return parsed
- Since this code changes the returned output from a list to a dict, we need to alter the rest of Xiaodi's code to handle this new input. I have altered the code to tolerate SNP ID's in a dict class. Here it is:
# BEGIN ACTUAL PROGRAM # read sequence from file sequence = get_sequence_from_fasta("example.fasta") print "Sequence received; please wait." # look up data b = blast_snp_lookup(sequence) # extract data e = extract_snp_data(b) outputlist.append(str(len(e)) + " single nucleotide polymorphism(s) found:\n") if len(e) > 0: for i in e.iterkeys(): outputlist.append(i + "\n") else: outputlist.append("[none]\n") sys.exit() # nothing more to be done if no snp found #print '-' * 40 #print "\n" print "got a snp... still thinking..." # do an omim search on each snp for i in e.iterkeys(): o = omim_snp_search(i) if len(o) == 0: outputlist.append("No information found for " + i + "\n") continue # nothing more to be done if no records can be found # otherwise, find the allelic variant data outputlist.append(i + " details:" + "\n") for j in o: v = extract_allelic_variant_data(j.read()) if v != None: for a in v: outputlist.append(a.name + "\n") outputlist.append(a.mutation + "\n") outputlist.append(a.description + "\n") #print '-' * 40 #print "\n"
Complete code which removes falsely-identified SNPs
# Mike and I are writing this code in tandem. The goal of this code # is to remove falsely identified SNPs (we will integrate our two halves # in class tomorrow). # Parts go as follows: # 1. take input (dict file with rsID and keyed seq) # 2. align the SNP with reference sequence, # 3. find ambiguous base-pair position. # 4. check to see if the sequence BP is in the space of the SNP BP. # 5. remove erroneous SNP entries # 6. return output (updated dict) # I am doing parts 4-6. # given pos from part 3 for i in dict_file.iterkeys(): a=seq[pos] snpseq=dict_file.get(i) j = 0 k = 0 while j = 0: if snpseq[k] in IUPACData.ambiguous_dna_letters and not in IUPACData.unambiguous_dna_letters: b=snpseq[j] j = 1 else: k=k+1 b_values = IUPACData.ambiguous_dna_values[b] if a not in b_values: del dict_file[i] return dict_file