DataONE:GEO reuse study/Phase 1: Difference between revisions
From OpenWetWare
Jump to navigationJump to search
(remove brainstorming section, adding it to its own page) |
(→Uses: refine uses) |
||
Line 67: | Line 67: | ||
* so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing | * so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing | ||
== | ==Additional uses of this data collection== | ||
* could use this data to see how many publications use any one dataset | * could use this data to see how many publications use any one dataset | ||
* could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider... better off with data deposited longer ago | |||
* can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year | * can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year | ||
Revision as of 16:21, 20 June 2010
Research Plan
Overview
- Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
- Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
- Enumerate the PMC papers that reused GEO data
- Estimate what percent of these papers depended on the GEO data for their scientific contribution
Query details
- accession number formats:
- look at both GSE and GDS accession numbers
- use both the raw ID number like 200007572 and the stripped version without the 200... prefix. For example, search for both 200007572 and 7572
- search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"
Exclude data creation studies
- spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
- do this for all the PMC article hits? looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
- could use query from my BioLink paper:
(geo OR omnibus) AND microarray AND "gene expression" AND accession NOT (databases OR user OR users OR (public AND accessed) OR (downloaded AND published))
- or the more simple:
"gene expression omnibus” AND (submitted OR deposited)
- to do this transparently, query PMC results for each of these words:
- submitted
- deposited
- user*
- public
- accessed
- downloaded
- published
- to do this transparently, query PMC results for each of these words:
Estimate time lag for reuse
To estimate time lag:
- extract year
Estimate what percentage of reusers weren't the original authors
- see if AND pubmed_gds and NOT pmc_gds have any author overlaps? (note AND should be pubmed!)
- other idea: institution comparison using medline info
- better than submitter, because submitter not the whole story
- better than institution, because institution not precise in submission
Estimate what percent of reuse created "new science"
- classify if methods or informatics:
- journal name has informatics
- mesh term for methods?
- look at mesh overlap?
- look for metaanalysis mesh term?
Estimate what percent of these papers depended on the GEO data for their scientific contribution
- Any good ideas on how to do this efficiently?
- find those which are/are not in informatics journals
- that use "methods" MeSH terms
- ??
- Any good ideas on how to do this efficiently?
Estimate the fraction of all papers that are in PMC
- use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
- restrict from 2007 to 2009
- result:
number of articles in PMC: 6311, number of articles in PubMed: 21569, so PMC contains 29.26% of related papers
- so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing
Additional uses of this data collection
- could use this data to see how many publications use any one dataset
- could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider... better off with data deposited longer ago
- can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year
Notes
- this query PMC full-text approach is similar (as per correspondence with GEO team) to that used by the GEO team to compile the 3rd party reuse page: http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html
- would be nice to figure out how to write all of these columns to google docs directly from code
Open Questions
Limitations
Important for argument
This is a conservative estimate because:
- our estimates do not consider reuses after our study timeframe
- many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
- could estimate this impact if we examine data deposited 7 years ago?
- Many papers not in PubMed Central
- using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
- our methods do not find studies that both create and reuse data
- to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
- we don't have an estimate of how many this is, would require manual inventory
- Many data citations not attributed using accession numbers
- don't have a good way to estimate this yet
- would require a manual inventory, similar to Sarah's data citation inventory in DataONE summer 2010 project
- maybe out-of-scope to get this estimate for this project, just admit it is an underestimate
Less important for argument
- Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)
- Deposits into PMC not stable over time, distribution may change over time