DataONE:GEO reuse study/Phase 1: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
m (rearranging sections)
(streamline content to point to protocol page)
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Research Plan==
==Aim==
===Overview===
* Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
* Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
* Enumerate the PMC papers that reused GEO data
* Estimate what percent of these papers depended on the GEO data for their scientific contribution


===Query details===
==Background==
* accession number formats:
** look at both GSE and GDS accession numbers
** use both the raw ID number like 200007572 and the stripped version without the 200... prefix.  For example, search for both 200007572 and 7572
** search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"


===Exclude data creation studies===
==Methods==
* spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
===Overview===
** do this for all the PMC article hits?  looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
* Using the method outlined at [[DataONE:Protocols/Find_GEO_reuses]]:
** could use query from my BioLink paper: 
** Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
  (geo OR omnibus)
** Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
  AND microarray
** Enumerate the PMC papers that reused GEO data
  AND "gene expression"     
** Estimate what percent of these papers depended on the GEO data for their scientific contribution
  AND accession
  NOT (databases
        OR user OR users
        OR (public AND accessed)
        OR (downloaded AND published))
** or the more simple: 
  "gene expression omnibus” AND (submitted OR deposited)
** to do this transparently, query PMC results for each of these words:
*** submitted
*** deposited
*** user*
*** public
*** accessed
*** downloaded
*** published
 
===Estimate time lag for reuse===
To estimate time lag:
* extract year
 
===Estimate what percentage of reusers weren't the original authors===
* see if AND pubmed_gds and NOT pmc_gds have any author overlaps?  (note AND should be pubmed!)
* other idea:  institution comparison using medline info
* better than submitter, because submitter not the whole story
* better than institution, because institution not precise in submission
 
===Estimate what percent of reuse created "new science"===
* classify if methods or informatics:
** journal name has informatics
** mesh term for methods?
* look at mesh overlap?
* look for metaanalysis mesh term?
 
===Estimate what percent of these papers depended on the GEO data for their scientific contribution===
** Any good ideas on how to do this efficiently?
*** find those which are/are not in informatics journals
*** that use "methods" MeSH terms
*** ??
 
===Estimate the fraction of all papers that are in PMC===
* use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
** restrict from 2007 to 2009
** result: 
  number of articles in PMC:  6311,
  number of articles in PubMed:  21569,
  so PMC contains 29.26% of related papers
* so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing
 
==Limitations==
===Important for argument===
This is a conservative estimate because:
*  our estimates do not consider reuses after our study timeframe
** many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
** could estimate this impact if we examine data deposited 7 years ago?
* Many papers not in PubMed Central
** using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
* our methods do not find studies that both create and reuse data
** to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
** we don't have an estimate of how many this is, would require manual inventory
* Many data citations not attributed using accession numbers
** don't have a good way to estimate this yet
** would require a manual inventory, similar to Sarah's data citation inventory in DataONE summer 2010 project
** maybe out-of-scope to get this estimate for this project, just admit it is an underestimate
 
===Less important for argument===
* Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)
* Deposits into PMC not stable over time, distribution may change over time


==Additional uses for this data collection==
===Details===
* could use this data to see how many publications use any one dataset
* see [[DataONE:Protocols/Find_GEO_reuses]]
* could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider... better off with data deposited longer ago
* can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year


==Open Questions==
==Results==
None right now


==Notes==
==Discussion==
* this query PMC full-text approach is similar (as per correspondence with GEO team) to that used by the GEO team to compile the 3rd party reuse page:  http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html
* would be nice to figure out how to write all of these columns to google docs directly from code

Latest revision as of 13:58, 14 July 2010

Aim

Background

Methods

Overview

  • Using the method outlined at DataONE:Protocols/Find_GEO_reuses:
    • Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
    • Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
    • Enumerate the PMC papers that reused GEO data
    • Estimate what percent of these papers depended on the GEO data for their scientific contribution

Details

Results

Discussion