Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-4-17

From OpenWetWare

Jump to: navigation, search

Tasks due today

Quality Checking Script

Here is a script for detecting mismatches between our sequence and returned SNPs. As I sat down to write this script, I realized that we are missing something important: a parser which can extract sequence from dbSNP. This is too much for me to tackle on my own; we will construct this in class tomorrow. This code will change when we interface it with the rest of our script. We can make this more object oriented at such a point.

#!/usr/bin/env python

# The purpose of this script is to detect mis-matches between our test sequence
# and the returned SNPs from dbSNP.  We will then inform the user if any SNPs
# had additional mutations that should question the validity of the SNP return.
# Once this initial construct is up and running, we will work to build in more
# detailed mis-match analysis so that, as linkage disequalibrium studies yield
# more information about when a SNP is significant, we can pass the information
# along to our users.

# Part 1: Condense test sequence and RS sequence to a single FASTA file

from Bio import GenBank, Seq
from Bio import formats

        # Assuming a list of RS id's and a test seqeunce as a FASTA.

snp_len = []
snp_parser = NEED SNP PARSER() # I have spoken to Xiaodi about how we can
                               # construct a parser to handle this task
snp_entry = GenBank.NCBIDictionary('nucelotide', 'genbank', parser = snp_parser)
for i in rsid:
    parsed_snp = snp_entry(rsid[i])
    formatter = FormatIO(parsed_snp, formats["genbank"], formats["fasta"])
    formatter.convert(parsed_snp, formatted_snp)
    updateseq = open('example.fasta', 'a')

# Part 2: Align the FASTA file using Clustal

cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'example.fasta'))
alignment = Clustalw.do_alignment(cline)
summary_align = AlignInfo.SummaryInfo(alignment)
con = summary_align.dumb_consensus()

# Part 3: Figure out where each SNP starts and stops relative to the test
#         sequence.  This allows us to establish a start and default stop for
#         comparing individual SNPs with the reference.

size_col = Set()
size_col = summary_align.get_column(1)
snp_count = len(size_col)-1
colset = Set()
for i in range(snp_count):
    start = 0
    while start == 0:
        for j in range(len(con)):
            if col[i+1] =! '-':
                start = j

# Part 4: Scan for and report mismatches in each SNP. Build in a function to tolerate
#         insertions/deletions.

    print "Scanning %n for mismatches with reference sequence", rsid[i]
    for j in range(start, start+snp_len[i]):
        if col[0] == '-': print "Deletion at basepair ", j
        elif col[i+1] == '-': print "Insertion at basepair ", j
        elif col[0] =! col[i+1]: print "Mismatch: ", col[i+1], j, col[0] 
        else: "No mismatches found in this SNP"

print "Done screening SNPs for mismatches.  Any mismatches may call into question"
print "the validity of a returned SNP.  Further studies in linkage disequilibrium"
print "will offer more information."

Potential Mutations to Consider

  • Insertions
  • Deletions
  • Truncations
  • Additional Point Mutations
    • Immediately Next to SNP
    • In large blocks but not close to SNP
  • Any Combination of these mutations
  • SNPs in the presence of other SNPs
Personal tools