Harvard:Biophysics 101/Notebook:ZS/2007-4-26: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(New page: ==Tackling the problem of finding information about ref seq== Xiaodi and Katie have been working on this, but had trouble parsing a blast result becuase (as I painfully found out) the XML ...)
 
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Tackling the problem of finding information about ref seq==
==Tackling the problem of finding information about ref seq==
Xiaodi and Katie have been working on this, but had trouble parsing a blast result becuase (as I painfully found out) the XML returns nothing useful, and before one could not get back the HTML which blast returns. I found the snippit to get the HTML version, which I then raw-parsed for the relevant information.
Xiaodi and Katie have been working on this, but had trouble parsing a blast result becuase (as I painfully found out) the XML returns nothing useful, and before one could not get back the HTML which blast returns. I found the snippit to get the HTML version, which I then raw-parsed for the relevant information. Then, this preliminarily determines surrounding genes of interest and if it hits any known CDS's. The code needs cleaned, but its a proof of principle :D. The data can then be fed to other entrez programs.


==Code==
==Code==
<pre>


<pre>
#Zachary Sun, Biophysics 101
#Zachary Sun, Biophysics 101
#4.26.07 code
#4.26.07 code
#Tackling the problem of finding surrounding CDS's and information about a reference sequence from blast,
#Tackling the problem of finding surrounding CDS's and information about a reference sequence from blast,
#this program parses the HTML output of blastn as of 4.26.07 for information which can then be fed
#this program parses the HTML output of blastn as of 4.26.07 for information which can then be fed
#into other NCBI programs to be programmed (to ultimately determine if a chunk of DNA is in a CDS).
#into other NCBI programs to be programmed,
#Saves a local file; to stop queuing the web, just comment the code with pound symbols at end.
#Saves a local file; to stop queuing the web, just comment the code with pound symbols at end.


Line 18: Line 18:
from xml.dom import minidom
from xml.dom import minidom


 
#class seq on first line, other CDS seq on second
seq = "CACCCTCGCCAGTTACGAGCTGCCGAGCCGCTTCCTAGGCTCTCTGCGAATACGGACACGCATGCCACCCACAACAACTTTTTAAAAGAATCAGACGTGTGAAGGATTCTATTCGAATTACTTCTGCTCTCTGCTTTTATCACTTCACTGTGGGTCTGGGCGCGGGCTTTCTGCCAGCTCCGCGGACGCTGCCTTCGTCCRGCCGCAGAGGCCCCGCGGTCAGGGTCCCGCGTGCGGGGTACCGGGGGCAGAACCAGCGCGTGACCGGGGTCCGCGGTGCCGCAACGCCCCGGGTCTGCGCAGAGGCCCCTGCAGTCCCTGCCCGGCCCAGTCCGAGCTTCCCGGGCGGGCCCCCAGTCCGGCGATTTGCAGGAACTTTCCCCGGCGCTCCCACGCGAAGC"
#seq = "CACCCTCGCCAGTTACGAGCTGCCGAGCCGCTTCCTAGGCTCTCTGCGAATACGGACACGCATGCCACCCACAACAACTTTTTAAAAGAATCAGACGTGTGAAGGATTCTATTCGAATTACTTCTGCTCTCTGCTTTTATCACTTCACTGTGGGTCTGGGCGCGGGCTTTCTGCCAGCTCCGCGGACGCTGCCTTCGTCCRGCCGCAGAGGCCCCGCGGTCAGGGTCCCGCGTGCGGGGTACCGGGGGCAGAACCAGCGCGTGACCGGGGTCCGCGGTGCCGCAACGCCCCGGGTCTGCGCAGAGGCCCCTGCAGTCCCTGCCCGGCCCAGTCCGAGCTTCCCGGGCGGGCCCCCAGTCCGGCGATTTGCAGGAACTTTCCCCGGCGCTCCCACGCGAAGC"
seq = "ATGCTGCGCCTATACCCAGGACCGATGGTAACTGAGGCGGAGGGGAAAGGAGGGCCTGAGATGGCAAGTCTGTCCTCCTCGGTGGTTCCTGTGTCCTTCATTTCCACTCTGCGAGAGTCTGTGCTGGACCCTGGAGTTGGTGGAGAAGGAGCCAGTGACAAGCAGAGGAGCAAACTGTCTTTATCACACTCCATGATCCCAGCTGCTAAAATCCACACTGAGCTCTGCTTACCAGCCTTCTTCTCTCCTGCTGGAACCCAGAGGAGGTTCCAGCAGCCTC"
# the above is a user-defined variable; put in your sequence to be Blast against!
# the above is a user-defined variable; put in your sequence to be Blast against!


Line 60: Line 61:
surroundEnd = []
surroundEnd = []
title = []
title = []
CDS = "false"


while(line):
while(line):
     findLine = line.find("spanning the HSP") #to find the sequence ID
     findLine = line.find("spanning the HSP") #to find the sequence I
     if findLine != -1:
     if findLine != -1:
         name = line[findLine+23:-1]
         name = line[findLine+23:-1]
 
    findLine = line.find("PREDICTED")
    if findLine != -1: #if part of a CDS
        end = line.find("Length")
        CDS = "true"
        break
     findLine = line.find("Features flanking this part of subject sequence:")
     findLine = line.find("Features flanking this part of subject sequence:")
     if findLine != -1: #found features flanking
     if findLine != -1: #found features flanking
        CDS = "false"
         line = load_file.readline()
         line = load_file.readline()
         while line.find("&from=") != -1: #while there is still a feature
         while line.find("&from=") != -1: #while there is still a feature
Line 102: Line 110:




record = RecInfo(surroundStart, surroundEnd, refStart, refEnd, hits, name, title)  
if CDS == "false":
record.printInfo()
    record = RecInfo(surroundStart, surroundEnd, refStart, refEnd, hits, name, title)  
    record.printInfo()
else:
    print "The snippit is a CDS:"
    print "Name: ", name
 
       
   
</pre>
 
==output==
input:
<pre>
CACCCTCGCCAGTTACGAGCTGCCGAGCCGCTTCCTAGGCTCTCTGCGAATACGGACACGCATGCCACCCACAACAACTTTTTAAAAGAATCAGACGTGTGAAGGATTCTATTCGAATTACTTCTGCTCTCTGCTTTTATCACTTCACTGTGGGTCTGGGCGCGGGCTTTCTGCCAGCTCCGCGGACGCTGCCTTCGTCCRGCCGCAGAGGCCCCGCGGTCAGGGTCCCGCGTGCGGGGTACCGGGGGCAGAACCAGCGCGTGACCGGGGTCCGCGGTGCCGCAACGCCCCGGGTCTGCGCAGAGGCCCCTGCAGTCCCTGCCCGGCCCAGTCCGAGCTTCCCGGGCGGGCCCCCAGTCCGGCGATTTGCAGGAACTTTCCCCGGCGCTCCCACGCGAAGC
</pre>
 
output:
<pre>
INFORMATION:
-----------
Gene name:  Homo sapiens chromosome 10 genomic contig, reference assembly
Location:  42968870 -> 42969270
Hit qualification:  Identities = 400/401 (99%), Gaps = 0/401 (0%)
 
Surrounding gene information:
----------------------------
3895 bp at 5' side: hypothetical protein :  42962770 -> 42964975
425 bp at 3' side: HtrA serine peptidase 1 :  42969695 -> 43022401
</pre>
 


##TO DO: Create code for determine CDS, now that we know bounds
input is a CDS:
<pre>


ATGCTGCGCCTATACCCAGGACCGATGGTAACTGAGGCGGAGGGGAAAGGAGGGCCTGAGATGGCAAGTCTGTCCTCCTCGGTGGTTCCTGTGTCCTTCATTTCCACTCTGCGAGAGTCTGTGCTGGACCCTGGAGTTGGTGGAGAAGGAGCCAGTGACAAGCAGAGGAGCAAACTGTCTTTATCACACTCCATGATCCCAGCTGCTAAAATCCACACTGAGCTCTGCTTACCAGCCTTCTTCTCTCCTGCTGGAACCCAGAGGAGGTTCCAGCAGCCTC
</pre>
</pre>


seq = "CACCCTCGCCAGTTACGAGCTGCCGAGCCGCTTCCTAGGCTCTCTGCGAATACGGACACGCATGCCACCCACAACAACTTTTTAAAAGAATCAGACGTGTGAAGGATTCTATTCGAATTACTTCTGCTCTCTGCTTTTATCACTTCACTGTGGGTCTGGGCGCGGGCTTTCTGCCAGCTCCGCGGACGCTGCCTTCGTCCRGCCGCAGAGGCCCCGCGGTCAGGGTCCCGCGTGCGGGGTACCGGGGGCAGAACCAGCGCGTGACCGGGGTCCGCGGTGCCGCAACGCCCCGGGTCTGCGCAGAGGCCCCTGCAGTCCCTGCCCGGCCCAGTCCGAGCTTCCCGGGCGGGCCCCCAGTCCGGCGATTTGCAGGAACTTTCCCCGGCGCTCCCACGCGAAGC"
output (needs cleaned/expanded):
<pre>
The snippit is a CDS:
Name:  PREDICTED: Homo sapiens hypothetical LOC387715 (LOC3877.. S=555 E=8.9e-156'" src="images/red.gif" width="500"></a></td>
</pre>

Latest revision as of 22:54, 25 April 2007

Tackling the problem of finding information about ref seq

Xiaodi and Katie have been working on this, but had trouble parsing a blast result becuase (as I painfully found out) the XML returns nothing useful, and before one could not get back the HTML which blast returns. I found the snippit to get the HTML version, which I then raw-parsed for the relevant information. Then, this preliminarily determines surrounding genes of interest and if it hits any known CDS's. The code needs cleaned, but its a proof of principle :D. The data can then be fed to other entrez programs.

Code


#Zachary Sun, Biophysics 101
#4.26.07 code
#Tackling the problem of finding surrounding CDS's and information about a reference sequence from blast,
#this program parses the HTML output of blastn as of 4.26.07 for information which can then be fed
#into other NCBI programs to be programmed,
#Saves a local file; to stop queuing the web, just comment the code with pound symbols at end.

from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
import xml.sax.handler
from xml.dom import minidom

#class seq on first line, other CDS seq on second
#seq = "CACCCTCGCCAGTTACGAGCTGCCGAGCCGCTTCCTAGGCTCTCTGCGAATACGGACACGCATGCCACCCACAACAACTTTTTAAAAGAATCAGACGTGTGAAGGATTCTATTCGAATTACTTCTGCTCTCTGCTTTTATCACTTCACTGTGGGTCTGGGCGCGGGCTTTCTGCCAGCTCCGCGGACGCTGCCTTCGTCCRGCCGCAGAGGCCCCGCGGTCAGGGTCCCGCGTGCGGGGTACCGGGGGCAGAACCAGCGCGTGACCGGGGTCCGCGGTGCCGCAACGCCCCGGGTCTGCGCAGAGGCCCCTGCAGTCCCTGCCCGGCCCAGTCCGAGCTTCCCGGGCGGGCCCCCAGTCCGGCGATTTGCAGGAACTTTCCCCGGCGCTCCCACGCGAAGC"
seq = "ATGCTGCGCCTATACCCAGGACCGATGGTAACTGAGGCGGAGGGGAAAGGAGGGCCTGAGATGGCAAGTCTGTCCTCCTCGGTGGTTCCTGTGTCCTTCATTTCCACTCTGCGAGAGTCTGTGCTGGACCCTGGAGTTGGTGGAGAAGGAGCCAGTGACAAGCAGAGGAGCAAACTGTCTTTATCACACTCCATGATCCCAGCTGCTAAAATCCACACTGAGCTCTGCTTACCAGCCTTCTTCTCTCCTGCTGGAACCCAGAGGAGGTTCCAGCAGCCTC"
# the above is a user-defined variable; put in your sequence to be Blast against!

class RecInfo():
    def __init__(self, a, b, c, d, e, f, g):
        self.surroundStart = a
        self.surroundEnd = b
        self.refStart = c
        self.refEnd = d
        self.hits = e
        self.name = f
        self.title = g
        self.inCDS = "?"

    def printInfo(self):
        print "INFORMATION:"
        print "-----------"
        print "Gene name: ", self.name
        print "Location: " , self.refStart, "->", self.refEnd
        print "Hit qualification: ", self.hits
        print "Surrounding gene information:"
        print "----------------------------"
        for (sStart, sEnd, n) in zip(self.surroundStart, self.surroundEnd, self.title):
            print n, ": " , sStart, "->", sEnd
            

        

result_handle = NCBIWWW.qblast("blastn","Test/gpipe/9606/allcontig_and_rna", seq, format_type="HTML") #the db is from the blast homepage (other db's dont give surrounding info) #
blast_results = result_handle.read()#

save_file = open("my_blast2.html","w")#
save_file.write(blast_results)#
save_file.close()#

load_file = open("my_blast2.html","r")
line = load_file.readline()

surroundStart = []
surroundEnd = []
title = []

CDS = "false"

while(line):
    findLine = line.find("spanning the HSP") #to find the sequence I
    if findLine != -1:
        name = line[findLine+23:-1]
    findLine = line.find("PREDICTED")
    if findLine != -1: #if part of a CDS
        end = line.find("Length")
        CDS = "true"
        break
    findLine = line.find("Features flanking this part of subject sequence:")
    if findLine != -1: #found features flanking
        CDS = "false"
        line = load_file.readline()
        while line.find("&from=") != -1: #while there is still a feature
            w = line.find("&from=") #index of start site
            x = line.find("&to=") #index of mid
            y = line.find("&view=gbwithparts") #index of end
            surroundStart.append(int(line[w+6:x]))
            surroundEnd.append(int(line[x+4:y]))
            w = line.find("gbwithparts") #index of gene title
            x = line.find("</a>") #index of end of title
            title.append(line[w+13:x])
            line = load_file.readline()

        line = load_file.readline() #manual parse
        line = load_file.readline() #manual parse
        hits = line

        seqFirstFound = "false" #if first part found

        while line.find("input type") == -1:#while not done with sequence
            if line.find("Sbjct") != -1: #if begin
                if(seqFirstFound == "false"): #if first sequence
                    temp = line.split(" ") #split by space
                    refStart = temp[2]
                    seqFirstFound = "true"
                else: #find end sequence
                    temp = line[:-1]#delete /n
                    temp = temp.split(" ")
                    refEnd = temp[6]
            line = load_file.readline()

        break
    line = load_file.readline()


if CDS == "false":
    record = RecInfo(surroundStart, surroundEnd, refStart, refEnd, hits, name, title) 
    record.printInfo()
else:
    print "The snippit is a CDS:"
    print "Name: ", name

        
    

output

input:

CACCCTCGCCAGTTACGAGCTGCCGAGCCGCTTCCTAGGCTCTCTGCGAATACGGACACGCATGCCACCCACAACAACTTTTTAAAAGAATCAGACGTGTGAAGGATTCTATTCGAATTACTTCTGCTCTCTGCTTTTATCACTTCACTGTGGGTCTGGGCGCGGGCTTTCTGCCAGCTCCGCGGACGCTGCCTTCGTCCRGCCGCAGAGGCCCCGCGGTCAGGGTCCCGCGTGCGGGGTACCGGGGGCAGAACCAGCGCGTGACCGGGGTCCGCGGTGCCGCAACGCCCCGGGTCTGCGCAGAGGCCCCTGCAGTCCCTGCCCGGCCCAGTCCGAGCTTCCCGGGCGGGCCCCCAGTCCGGCGATTTGCAGGAACTTTCCCCGGCGCTCCCACGCGAAGC

output:

INFORMATION:
-----------
Gene name:  Homo sapiens chromosome 10 genomic contig, reference assembly
Location:  42968870 -> 42969270
Hit qualification:   Identities = 400/401 (99%), Gaps = 0/401 (0%)

Surrounding gene information:
----------------------------
3895 bp at 5' side: hypothetical protein :  42962770 -> 42964975
425 bp at 3' side: HtrA serine peptidase 1 :  42969695 -> 43022401


input is a CDS:


ATGCTGCGCCTATACCCAGGACCGATGGTAACTGAGGCGGAGGGGAAAGGAGGGCCTGAGATGGCAAGTCTGTCCTCCTCGGTGGTTCCTGTGTCCTTCATTTCCACTCTGCGAGAGTCTGTGCTGGACCCTGGAGTTGGTGGAGAAGGAGCCAGTGACAAGCAGAGGAGCAAACTGTCTTTATCACACTCCATGATCCCAGCTGCTAAAATCCACACTGAGCTCTGCTTACCAGCCTTCTTCTCTCCTGCTGGAACCCAGAGGAGGTTCCAGCAGCCTC

output (needs cleaned/expanded):

The snippit is a CDS:
Name:  PREDICTED: Homo sapiens hypothetical LOC387715 (LOC3877.. S=555 E=8.9e-156'" src="images/red.gif" width="500"></a></td>