DataONE:GEO reuse study/Phase 1: Difference between revisions

Latest revision as of 13:58, 14 July 2010

Aim

Background

Methods

Overview

Using the method outlined at DataONE:Protocols/Find_GEO_reuses:
- Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
- Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
- Enumerate the PMC papers that reused GEO data
- Estimate what percent of these papers depended on the GEO data for their scientific contribution

Details

see DataONE:Protocols/Find_GEO_reuses

@@ Line 1: / Line 1: @@
-==Research Plan==
+==Aim==
-===Overview===
-* Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
-* Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
-* Enumerate the PMC papers that reused GEO data
-* Estimate what percent of these papers depended on the GEO data for their scientific contribution
-===Query details===
+==Background==
-* accession number formats:
-** look at both GSE and GDS accession numbers
-** use both the raw ID number like 200007572 and the stripped version without the 200... prefix.  For example, search for both 200007572 and 7572
-** search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"
-===Exclude data creation studies===
+==Methods==
-* spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
+===Overview===
-** do this for all the PMC article hits?  looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
+* Using the method outlined at [[DataONE:Protocols/Find_GEO_reuses]]:
-** could use query from my BioLink paper:
+** Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
-  (geo OR omnibus)
+** Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
-  AND microarray
+** Enumerate the PMC papers that reused GEO data
-  AND "gene expression"
+** Estimate what percent of these papers depended on the GEO data for their scientific contribution
-  AND accession
-  NOT (databases
-         OR user OR users
-         OR (public AND accessed)
-         OR (downloaded AND published))
-** or the more simple:
-  "gene expression omnibus” AND (submitted OR deposited)
-** to do this transparently, query PMC results for each of these words:
-*** submitted
-*** deposited
-*** user*
-*** public
-*** accessed
-*** downloaded
-*** published
-===Estimate time lag for reuse===
-To estimate time lag:
-* extract year
-===Estimate what percentage of reusers weren't the original authors===
-* see if AND pubmed_gds and NOT pmc_gds have any author overlaps?  (note AND should be pubmed!)
-* other idea:  institution comparison using medline info
-* better than submitter, because submitter not the whole story
-* better than institution, because institution not precise in submission
-===Estimate what percent of reuse created "new science"===
-* classify if methods or informatics:
-** journal name has informatics
-** mesh term for methods?
-* look at mesh overlap?
-* look for metaanalysis mesh term?
-===Estimate what percent of these papers depended on the GEO data for their scientific contribution===
-** Any good ideas on how to do this efficiently?
-*** find those which are/are not in informatics journals
-*** that use "methods" MeSH terms
-*** ??
-===Estimate the fraction of all papers that are in PMC===
-* use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
-** restrict from 2007 to 2009
-** result:
-  number of articles in PMC:  6311,
-  number of articles in PubMed:  21569,
-  so PMC contains 29.26% of related papers
-* so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing
-==Limitations==
-===Important for argument===
-This is a conservative estimate because:
-*  our estimates do not consider reuses after our study timeframe
-** many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
-** could estimate this impact if we examine data deposited 7 years ago?
-* Many papers not in PubMed Central
-** using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
-* our methods do not find studies that both create and reuse data
-** to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
-** we don't have an estimate of how many this is, would require manual inventory
-* Many data citations not attributed using accession numbers
-** don't have a good way to estimate this yet
-** would require a manual inventory, similar to Sarah's data citation inventory in DataONE summer 2010 project
-** maybe out-of-scope to get this estimate for this project, just admit it is an underestimate
-===Less important for argument===
-* Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)
-* Deposits into PMC not stable over time, distribution may change over time
-==Additional uses for this data collection==
+===Details===
-* could use this data to see how many publications use any one dataset
+* see [[DataONE:Protocols/Find_GEO_reuses]]
-* could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider... better off with data deposited longer ago
-* can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year
-==Open Questions==
+==Results==
-None right now
-==Notes==
+==Discussion==
-* this query PMC full-text approach is similar (as per correspondence with GEO team) to that used by the GEO team to compile the 3rd party reuse page:  http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html
-* would be nice to figure out how to write all of these columns to google docs directly from code

DataONE:GEO reuse study/Phase 1: Difference between revisions

Latest revision as of 13:58, 14 July 2010

Contents

Aim

Background

Methods

Overview

Details

Results

Discussion

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools