Harvard:Biophysics 101/2007/Notebook:CChi/2007-5-3
Progress
So we're all on the same page, I'll just post what I have when I make changes. Right now the script I have takes a PubMed search term (in mine, I used rs#), prints a list of top 5 article hits, and gives a list of all mesh terms (parsed), and a list of all major topic mesh terms (also parsed). There's example output so you can see what the lists look like. Let me know by email if there's something you want to see changed. Resmi and I will work on making the PubMed inquiries more efficient later, but for now, it spits out the mesh term lists that we want.
Code
I/O formatting
- Input: PubMed Search term, I have rs# pasted in
- Output: 2 lists -- one with all mesh terms, one with only major terms
- terms are parsed to remove the symbols
- in terms with qualifier terms, i just removed the slashes and put in spaces, since the terms would only be relevant as a group, anyway
- Also prints list of top 5 pubmed article hits (NON OMIM)
Script modified for efficiency in Pubmedding
Shawn talked about how Resmi and my codes were inefficient in the ways we called PubMed for each PMID and that we should instead do one call of all the PMIDs. After quite a painful ordeal for something like 5 lines of codes, I can say with relief that on my little thinkpad, it's about 5 times faster (seems counter-intuitive since i'm only getting 3 pmids here...)
So, same input/output as before, but as it's 5am, I don't trust myself to put it into the compiled version. I'll post my version and then the integration-ready version for easy editting in class.
The code <syntax type='python'> from Bio import PubMed from Bio import Medline from Bio import EUtils from Bio.EUtils import DBIdsClient import string
- DBIdsClient to search list of pmids for snp_id (default db is pubmed)
client = DBIdsClient.DBIdsClient() pmids = client.search("rs11200638") if len(pmids) == 0:
print "SNP not in PubMed."
else:
# Fetch medline record file from PubMed of all corresponding pmids records = pmids[0:5].efetch(retmode="text",rettype="medline") # Medline Record Parser to parse medline records into record objects rec_parser = Medline.RecordParser() medline = Medline.Iterator(records, parser = rec_parser)
# different lists of mesh terms all_mesh = [] all_mesh_terms = [] major_mesh_terms = [] # loop through the record objects for cur_record in medline: print '\n', cur_record.title, cur_record.authors, cur_record.source mesh_headings = cur_record.mesh_headings if len(mesh_headings): all_mesh = parse_mesh(mesh_headings) all_mesh_terms.extend(all_mesh[0]) major_mesh_terms.extend(all_mesh[1]) print '\n', all_mesh_terms, '\n', major_mesh_terms
</syntax>
The integrable code <syntax type='python'>
# The mesh terms stuff - THIS NEEDS TO BE ADDED TO OUTPUTLIST at some point # DBIdsClient to search list of pmids for snp_id (default db is pubmed) client = DBIdsClient.DBIdsClient() pmids = client.search(snp_id) if len(pmids) == 0: outputlist.append("No articles found for " + snp_id + "\n") else: # Fetch medline record file from PubMed of all corresponding pmids records = pmids[0:5].efetch(retmode="text",rettype="medline") # Medline Record Parser to parse medline records into record objects rec_parser = Medline.RecordParser() medline = Medline.Iterator(records, parser = rec_parser) all_mesh = [] all_mesh_terms = [] major_mesh_terms = [] # loop through the record objects for cur_record in medline: print '\n', cur_record.title, cur_record.authors, cur_record.source mesh_headings = cur_record.mesh_headings if len(mesh_headings): all_mesh = parse_mesh(mesh_headings) all_mesh_terms.extend(all_mesh[0]) major_mesh_terms.extend(all_mesh[1]) print "ALL MESH TERMS", '\n', all_mesh_terms, '\n', major_mesh_terms
</syntax>
Script seen in compiled version
from Bio import PubMed from Bio import Medline import string # parses a mesh term to remove * and / def parse_term(str, bool): parsed_term = str if(bool): parsed_term = parsed_term.replace('*', '') if str.find('/') != -1: parsed_term = parsed_term.replace('/', ' ') return parsed_term # parses list of mesh terms # returns embedded list, one with all terms and one major terms def parse_mesh(list): all_mesh_terms = [] major_mesh_terms = [] mesh_term = '' for i in range(len(list)): major = False if list[i].find('*') == -1: mesh_term = parse_term(list[i], major) all_mesh_terms.append(mesh_term) else: major = True mesh_term = parse_term(list[i], major) major_mesh_terms.append(mesh_term) all_mesh_terms.append(mesh_term) all_mesh = [all_mesh_terms, major_mesh_terms] return all_mesh article_ids = PubMed.search_for("rs11200638") rec_parser = Medline.RecordParser() medline_dict = PubMed.Dictionary(parser = rec_parser) all_mesh = [] all_mesh_terms = [] major_mesh_terms = [] for did in article_ids[0:5]: cur_record = medline_dict[did] print '\n', cur_record.title, cur_record.authors, cur_record.source mesh_headings = cur_record.mesh_headings if len(mesh_headings) != 0: all_mesh = parse_mesh(mesh_headings) all_mesh_terms.extend(all_mesh[0]) major_mesh_terms.extend(all_mesh[1]) print '\n', all_mesh_terms, '\n', major_mesh_terms
Output
HTRA1 promoter polymorphism predisposes Japanese to age-related macular degeneration. ['Yoshida T', 'DeWan A', 'Zhang H', 'Sakamoto R', 'Okamoto H', 'Minami M', 'Obazawa M', 'Mizota A', 'Tanaka M', 'Saito Y', 'Takagi I', 'Hoh J', 'Iwata T'] Mol Vis. 2007 Apr 4;13:545-8. HTRA1 Variant Confers Similar Risks to Geographic Atrophy and Neovascular Age-related Macular Degeneration. ['Cameron DJ', 'Yang Z', 'Gibbs D', 'Chen H', 'Kaminoh Y', 'Jorgensen A', 'Zeng J', 'Luo L', 'Brinton E', 'Brinton G', 'Brand JM', 'Bernstein PS', 'Zabriskie NA', 'Tang S', 'Constantine R', 'Tong Z', 'Zhang K'] Cell Cycle. 2007 May 16;6(9). A variant of the HTRA1 gene increases susceptibility to age-related macular degeneration. ['Yang Z', 'Camp NJ', 'Sun H', 'Tong Z', 'Gibbs D', 'Cameron DJ', 'Chen H', 'Zhao Y', 'Pearson E', 'Li X', 'Chien J', 'Dewan A', 'Harmon J', 'Bernstein PS', 'Shridhar V', 'Zabriskie NA', 'Hoh J', 'Howes K', 'Zhang K'] Science. 2006 Nov 10;314(5801):992-3. Epub 2006 Oct 19. ['Aged', 'Aging', 'Alleles', 'Case-Control Studies', 'Chromosomes, Human, Pair 10 genetics', 'Cohort Studies', 'European Continental Ancestry Group genetics', 'Female', 'Genetic Predisposition to Disease', 'Genotype', 'Homozygote', 'Humans', 'Lymphocytes enzymology', 'Macular Degeneration genetics', 'Male', 'Middle Aged', 'Pigment Epithelium of Eye enzymology', 'Polymorphism, Single Nucleotide', 'Promoter Regions (Genetics)', 'RNA, Messenger genetics metabolism', 'Retinal Drusen metabolism', 'Reverse Transcriptase Polymerase Chain Reaction', 'Serine Endopeptidases analysis genetics metabolism'] ['Genetic Predisposition to Disease', 'Macular Degeneration genetics', 'Polymorphism, Single Nucleotide', 'Promoter Regions (Genetics)', 'Serine Endopeptidases analysis genetics metabolism']