Harvard:Biophysics 101/2007/Notebook:CChi/2007-5-3

Progress

So we're all on the same page, I'll just post what I have when I make changes. Right now the script I have takes a PubMed search term (in mine, I used rs#), prints a list of top 5 article hits, and gives a list of all mesh terms (parsed), and a list of all major topic mesh terms (also parsed). There's example output so you can see what the lists look like. Let me know by email if there's something you want to see changed. Resmi and I will work on making the PubMed inquiries more efficient later, but for now, it spits out the mesh term lists that we want.

Code

I/O formatting

Input: PubMed Search term, I have rs# pasted in
Output: 2 lists -- one with all mesh terms, one with only major terms
- terms are parsed to remove the symbols
- in terms with qualifier terms, i just removed the slashes and put in spaces, since the terms would only be relevant as a group, anyway
- Also prints list of top 5 pubmed article hits (NON OMIM)

Script modified for efficiency in Pubmedding

Shawn talked about how Resmi and my codes were inefficient in the ways we called PubMed for each PMID and that we should instead do one call of all the PMIDs. After quite a painful ordeal for something like 5 lines of codes, I can say with relief that on my little thinkpad, it's about 5 times faster (seems counter-intuitive since i'm only getting 3 pmids here...)

So, same input/output as before, but as it's 5am, I don't trust myself to put it into the compiled version. I'll post my version and then the integration-ready version for easy editting in class.

The code <syntax type='python'> from Bio import PubMed from Bio import Medline from Bio import EUtils from Bio.EUtils import DBIdsClient import string

DBIdsClient to search list of pmids for snp_id (default db is pubmed)

client = DBIdsClient.DBIdsClient() pmids = client.search("rs11200638") if len(pmids) == 0:

   print "SNP not in PubMed."

else:

   # Fetch medline record file from PubMed of all corresponding pmids
   records = pmids[0:5].efetch(retmode="text",rettype="medline")
   # Medline Record Parser to parse medline records into record objects
   rec_parser = Medline.RecordParser()
   medline = Medline.Iterator(records, parser = rec_parser)

   # different lists of mesh terms
   all_mesh = []
   all_mesh_terms = []
   major_mesh_terms = []
   
   # loop through the record objects
   for cur_record in medline:
       print '\n', cur_record.title, cur_record.authors, cur_record.source
       mesh_headings = cur_record.mesh_headings
       if len(mesh_headings):
           all_mesh = parse_mesh(mesh_headings)
           all_mesh_terms.extend(all_mesh[0])
           major_mesh_terms.extend(all_mesh[1])
   
   print '\n', all_mesh_terms, '\n', major_mesh_terms

</syntax>

The integrable code <syntax type='python'>

       # The mesh terms stuff - THIS NEEDS TO BE ADDED TO OUTPUTLIST at some point
       # DBIdsClient to search list of pmids for snp_id (default db is pubmed)
       client = DBIdsClient.DBIdsClient()
       pmids = client.search(snp_id)
       if len(pmids) == 0:
           outputlist.append("No articles found for " + snp_id + "\n")
       else:
           # Fetch medline record file from PubMed of all corresponding pmids
           records = pmids[0:5].efetch(retmode="text",rettype="medline")
           # Medline Record Parser to parse medline records into record objects
           rec_parser = Medline.RecordParser()
           medline = Medline.Iterator(records, parser = rec_parser)	

           all_mesh = []
           all_mesh_terms = []
           major_mesh_terms = []
           # loop through the record objects
           for cur_record in medline:
               print '\n', cur_record.title, cur_record.authors, cur_record.source
               mesh_headings = cur_record.mesh_headings
               if len(mesh_headings):
                   all_mesh = parse_mesh(mesh_headings)
                   all_mesh_terms.extend(all_mesh[0])
                   major_mesh_terms.extend(all_mesh[1])

           print "ALL MESH TERMS", '\n', all_mesh_terms, '\n', major_mesh_terms

</syntax>

Script seen in compiled version

from Bio import PubMed
from Bio import Medline
import string

# parses a mesh term to remove * and /
def parse_term(str, bool):
    parsed_term = str
    if(bool):
        parsed_term = parsed_term.replace('*', '')
    if str.find('/') != -1:
       parsed_term = parsed_term.replace('/', ' ')
    return parsed_term

# parses list of mesh terms
# returns embedded list, one with all terms and one major  terms
def parse_mesh(list):
    all_mesh_terms = []
    major_mesh_terms = []
    mesh_term = ''
    for i in range(len(list)):
        major = False
        if list[i].find('*') == -1:
            mesh_term = parse_term(list[i], major)
            all_mesh_terms.append(mesh_term)
        else:
            major = True
            mesh_term = parse_term(list[i], major)
            major_mesh_terms.append(mesh_term)
            all_mesh_terms.append(mesh_term)
    all_mesh = [all_mesh_terms, major_mesh_terms]
    return all_mesh
    

article_ids = PubMed.search_for("rs11200638")
rec_parser = Medline.RecordParser()
medline_dict = PubMed.Dictionary(parser = rec_parser)

all_mesh = []
all_mesh_terms = []
major_mesh_terms = []
for did in article_ids[0:5]:
    cur_record = medline_dict[did]
    print '\n', cur_record.title, cur_record.authors, cur_record.source
    mesh_headings = cur_record.mesh_headings
    if len(mesh_headings) != 0:
        all_mesh = parse_mesh(mesh_headings)
        all_mesh_terms.extend(all_mesh[0])
        major_mesh_terms.extend(all_mesh[1])

print '\n', all_mesh_terms, '\n', major_mesh_terms

Output

HTRA1 promoter polymorphism predisposes Japanese to age-related macular
degeneration. ['Yoshida T', 'DeWan A', 'Zhang H', 'Sakamoto R', 'Okamoto H', 'Minami M', 'Obazawa M', 'Mizota A', 'Tanaka M', 'Saito Y', 'Takagi I', 'Hoh J', 'Iwata T'] Mol Vis. 2007 Apr 4;13:545-8.

HTRA1 Variant Confers Similar Risks to Geographic Atrophy and Neovascular
Age-related Macular Degeneration. ['Cameron DJ', 'Yang Z', 'Gibbs D', 'Chen H', 'Kaminoh Y', 'Jorgensen A', 'Zeng J', 'Luo L', 'Brinton E', 'Brinton G', 'Brand JM', 'Bernstein PS', 'Zabriskie NA', 'Tang S', 'Constantine R', 'Tong Z', 'Zhang K'] Cell Cycle. 2007 May 16;6(9).

A variant of the HTRA1 gene increases susceptibility to age-related
macular degeneration. ['Yang Z', 'Camp NJ', 'Sun H', 'Tong Z', 'Gibbs D', 'Cameron DJ', 'Chen H', 'Zhao Y', 'Pearson E', 'Li X', 'Chien J', 'Dewan A', 'Harmon J', 'Bernstein PS', 'Shridhar V', 'Zabriskie NA', 'Hoh J', 'Howes K', 'Zhang K'] Science. 2006 Nov 10;314(5801):992-3. Epub 2006 Oct 19.

['Aged', 'Aging', 'Alleles', 'Case-Control Studies', 'Chromosomes, Human, Pair 10 genetics', 'Cohort Studies', 'European Continental Ancestry Group genetics', 'Female', 'Genetic Predisposition to Disease', 'Genotype', 'Homozygote', 'Humans', 'Lymphocytes enzymology', 'Macular Degeneration genetics', 'Male', 'Middle Aged', 'Pigment Epithelium of Eye enzymology', 'Polymorphism, Single Nucleotide', 'Promoter Regions (Genetics)', 'RNA, Messenger genetics metabolism', 'Retinal Drusen metabolism', 'Reverse Transcriptase Polymerase Chain Reaction', 'Serine Endopeptidases analysis genetics metabolism'] 
['Genetic Predisposition to Disease', 'Macular Degeneration genetics', 'Polymorphism, Single Nucleotide', 'Promoter Regions (Genetics)', 'Serine Endopeptidases analysis genetics metabolism']

Harvard:Biophysics 101/2007/Notebook:CChi/2007-5-3

Contents

Progress

Code

Script modified for efficiency in Pubmedding

Script seen in compiled version

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools