DataONE:GEO reuse study/Phase 1

From OpenWetWare

< DataONE:GEO reuse study(Difference between revisions)
Jump to: navigation, search
(remove brainstorming section, adding it to its own page)
Current revision (15:58, 14 July 2010) (view source)
(streamline content to point to protocol page)
 
(6 intermediate revisions not shown.)
Line 1: Line 1:
-
==Research Plan==
+
==Aim==
-
===Overview===
+
-
* Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
+
-
* Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
+
-
* Enumerate the PMC papers that reused GEO data
+
-
* Estimate what percent of these papers depended on the GEO data for their scientific contribution
+
-
===Query details===
+
==Background==
-
* accession number formats:
+
-
** look at both GSE and GDS accession numbers
+
-
** use both the raw ID number like 200007572 and the stripped version without the 200... prefix.  For example, search for both 200007572 and 7572
+
-
** search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"
+
-
 
+
-
===Exclude data creation studies===
+
-
* spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
+
-
** do this for all the PMC article hits?  looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
+
-
** could use query from my BioLink paper: 
+
-
  (geo OR omnibus)
+
-
  AND microarray
+
-
  AND "gene expression"     
+
-
  AND accession
+
-
  NOT (databases
+
-
        OR user OR users
+
-
        OR (public AND accessed)
+
-
        OR (downloaded AND published))
+
-
** or the more simple: 
+
-
  "gene expression omnibus” AND (submitted OR deposited)
+
-
** to do this transparently, query PMC results for each of these words:
+
-
*** submitted
+
-
*** deposited
+
-
*** user*
+
-
*** public
+
-
*** accessed
+
-
*** downloaded
+
-
*** published
+
-
 
+
-
===Estimate time lag for reuse===
+
-
To estimate time lag:
+
-
* extract year
+
-
 
+
-
===Estimate what percentage of reusers weren't the original authors===
+
-
* see if AND pubmed_gds and NOT pmc_gds have any author overlaps?  (note AND should be pubmed!)
+
-
* other idea:  institution comparison using medline info
+
-
* better than submitter, because submitter not the whole story
+
-
* better than institution, because institution not precise in submission
+
-
 
+
-
===Estimate what percent of reuse created "new science"===
+
-
* classify if methods or informatics:
+
-
** journal name has informatics
+
-
** mesh term for methods?
+
-
* look at mesh overlap?
+
-
* look for metaanalysis mesh term?
+
-
 
+
-
===Estimate what percent of these papers depended on the GEO data for their scientific contribution===
+
-
** Any good ideas on how to do this efficiently?
+
-
*** find those which are/are not in informatics journals
+
-
*** that use "methods" MeSH terms
+
-
*** ??
+
-
 
+
-
===Estimate the fraction of all papers that are in PMC===
+
-
* use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
+
-
** restrict from 2007 to 2009
+
-
** result: 
+
-
  number of articles in PMC:  6311,
+
-
  number of articles in PubMed:  21569,
+
-
  so PMC contains 29.26% of related papers
+
-
* so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing
+
-
 
+
-
==Uses==
+
-
* could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider
+
-
* could use this data to see how many publications use any one dataset
+
-
* can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year
+
-
 
+
-
==Notes==
+
-
* this query PMC full-text approach is similar (as per correspondence with GEO team) to that used by the GEO team to compile the 3rd party reuse page:  http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html
+
-
* would be nice to figure out how to write all of these columns to google docs directly from code
+
 +
==Methods==
 +
===Overview===
 +
* Using the method outlined at [[DataONE:Protocols/Find_GEO_reuses]]:
 +
** Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
 +
** Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
 +
** Enumerate the PMC papers that reused GEO data
 +
** Estimate what percent of these papers depended on the GEO data for their scientific contribution
-
==Open Questions==
+
===Details===
 +
* see [[DataONE:Protocols/Find_GEO_reuses]]
-
==Limitations==
+
==Results==
-
===Important for argument===
+
-
This is a conservative estimate because:
+
-
*  our estimates do not consider reuses after our study timeframe
+
-
** many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
+
-
** could estimate this impact if we examine data deposited 7 years ago?
+
-
* Many papers not in PubMed Central
+
-
** using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
+
-
* our methods do not find studies that both create and reuse data
+
-
** to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
+
-
** we don't have an estimate of how many this is, would require manual inventory
+
-
* Many data citations not attributed using accession numbers
+
-
** don't have a good way to estimate this yet
+
-
** would require a manual inventory, similar to Sarah's data citation inventory in DataONE summer 2010 project
+
-
** maybe out-of-scope to get this estimate for this project, just admit it is an underestimate
+
-
===Less important for argument===
+
==Discussion==
-
* Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)
+
-
* Deposits into PMC not stable over time, distribution may change over time
+

Current revision

Contents

Aim

Background

Methods

Overview

  • Using the method outlined at DataONE:Protocols/Find_GEO_reuses:
    • Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
    • Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
    • Enumerate the PMC papers that reused GEO data
    • Estimate what percent of these papers depended on the GEO data for their scientific contribution

Details

Results

Discussion

Personal tools