Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-1: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
m (New page: ==Tasks Due Today== ===Change parse code to condense rsIDs and SNP seq lists from XML into a single dict file (appropriate input for code removing false SNPs)=== * This code should replace...)
 
mNo edit summary
Line 60: Line 60:
#print '-' * 40
#print '-' * 40
#print "\n"
#print "\n"
</pre>
===Complete code which removes falsely-identified SNPs===
<pre>
# Mike and I are writing this code in tandem.  The goal of this code
# is to remove falsely identified SNPs (we will integrate our two halves
# in class tomorrow).
# Parts go as follows:
# 1. take input (dict file with rsID and keyed seq)
# 2. align the SNP with reference sequence,
# 3. find ambiguous base-pair position.
# 4. check to see if the sequence BP is in the space of the SNP BP.
# 5. remove erroneous SNP entries
# 6. return output (updated dict)
# I am doing parts 4-6.
# given pos from part 3
for i in dict_file.iterkeys():
    a=seq[pos]
    snpseq=dict_file.get(i)
    j = 0
    k = 0
    while j = 0:
        if snpseq[k] in IUPACData.ambiguous_dna_letters and not in IUPACData.unambiguous_dna_letters:
            b=snpseq[j]
            j = 1
        else: k=k+1
    b_values = IUPACData.ambiguous_dna_values[b]
    if a not in b_values:
        del dict_file[i]
return dict_file
</pre>
</pre>

Revision as of 23:11, 30 April 2007

Tasks Due Today

Change parse code to condense rsIDs and SNP seq lists from XML into a single dict file (appropriate input for code removing false SNPs)

  • This code should replace Xiaodi's section, which bears the same name.
# extracts snp data
def extract_snp_data(str):
	dom = parseString(str)
	variants = dom.getElementsByTagName("Hit")
	if len(variants) == 0:
		return
	parsed = {}
	for v in variants:
		# now populate the struct
		id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes)
		score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes)
		parsed_seq = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes)
		# only consider it a genuine snp if the hit score is above 100
		if int(score) > 100:
			parsed[id] = parsed_seq
	return parsed
  • Since this code changes the returned output from a list to a dict, we need to alter the rest of Xiaodi's code to handle this new input. I have altered the code to tolerate SNP ID's in a dict class. Here it is:
# BEGIN ACTUAL PROGRAM
# read sequence from file
sequence = get_sequence_from_fasta("example.fasta")
print "Sequence received; please wait."
 
# look up data
b = blast_snp_lookup(sequence)
# extract data
e = extract_snp_data(b)
outputlist.append(str(len(e)) + " single nucleotide polymorphism(s) found:\n")
if len(e) > 0:
	for i in e.iterkeys():
		outputlist.append(i + "\n")
else:
	outputlist.append("[none]\n")
	sys.exit() # nothing more to be done if no snp found
#print '-' * 40
#print "\n"
 
print "got a snp... still thinking..."

# do an omim search on each snp
for i in e.iterkeys():
	o = omim_snp_search(i)
	if len(o) == 0:
		outputlist.append("No information found for " + i + "\n")
		continue # nothing more to be done if no records can be found
	# otherwise, find the allelic variant data
	outputlist.append(i + " details:" + "\n")
	for j in o:
		v = extract_allelic_variant_data(j.read())
		if v != None:
			for a in v:
				outputlist.append(a.name + "\n")
				outputlist.append(a.mutation + "\n")
				outputlist.append(a.description + "\n")
	#print '-' * 40
	#print "\n"

Complete code which removes falsely-identified SNPs

# Mike and I are writing this code in tandem.  The goal of this code
# is to remove falsely identified SNPs (we will integrate our two halves
# in class tomorrow).

# Parts go as follows:
# 1. take input (dict file with rsID and keyed seq)
# 2. align the SNP with reference sequence,
# 3. find ambiguous base-pair position.
# 4. check to see if the sequence BP is in the space of the SNP BP.
# 5. remove erroneous SNP entries
# 6. return output (updated dict)

# I am doing parts 4-6.

# given pos from part 3

for i in dict_file.iterkeys():
    a=seq[pos]
    snpseq=dict_file.get(i)
    j = 0
    k = 0
    while j = 0:
        if snpseq[k] in IUPACData.ambiguous_dna_letters and not in IUPACData.unambiguous_dna_letters:
            b=snpseq[j]
            j = 1
        else: k=k+1
    b_values = IUPACData.ambiguous_dna_values[b]
    if a not in b_values:
        del dict_file[i]

return dict_file