DataONE:GEO reuse study/Phase 1
- Query PubMed Central for GEO accession number patterns
- Only look at one year of PMC because deposit rate (and possibly spectrum) not constant over time
- Also look at Highwire Press, Google Scholar, other full text sources?
- More difficult because can't process queries automatically
- Look for accession number patterns for datasets and data series?
Important for argument
This is a conservative estimate because:
- Many papers not in PMC (source for percentages?)
- Many data citations not attributed using accession numbers (source for percentages?)
Less important for argument
- Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)
- Deposits into PMC not stable over time, distribution may change over time
In the meantime, here's a quick overview of where I'm headed, with details to follow as I see how it pans out. Feedback welcome, now or later.
I'm thinking that analysis of GEO data reuse could enable some interesting, straight-forward analysis relevant to "value of data that is being lost."
One of the themes of John Wilbanks's talk at the Science Commons symposium was that science, and science data, should be generative (ala Jonathan Zittrain: “Generativity is a systems capacity to produce unanticipated change from unfiltered contributions from broad and varied audiences").
Illustrating that a significant proportion of deposits to GEO have resulted in generative science would imply that data *not* submitted to GEO is a real loss to the scientific community's ability to "produce unanticipated change" and thus a loss of value.
Specifically, here's what I have in mind:
- Search the full text of all papers in PubMed Central for mention of each GEO accession number. Exclude papers that are in the "primary citation field" for that accession number. (This is the mechanism used to generate GEO's reuse list, I confirmed recently).
- For each of the PMC articles that reuse GEO data, collect PubMed metadata about authors, MeSH indexing terms, date of publication, etc.
- For each GEO dataset with at least one reusing article, collect metadata from GEO and PubMed
This would facilitate many possible analyses:
- as you proposed, investigate if there is indeed a "long tail" of data reuse. Are just a few datasets reused many times, or are many datasets reused?
- look at the rate of reuse over time, to see if it is accelerating relative to the rate of deposit
- how often are the reusers from the same institution as the data submitters, and how often is it someone from a different institution or country? (I could imagine * a map here, with arrows from data creation to data use.) Is the distribution of distinct authors and institutions flatter in reuse than in creation, indicating the inclusion of a broad audience?
- to what extent are the topics of papers that create data the same as the topics of papers that reuse the data? For example, if we clustered the creation and reuse articles, would the creation papers cluster together, separately from the reuse articles, or would the (create, reuse) pairs usually cluster together?
- to what degree do the creation and reuse papers participate in the same scientific conversations? Are there often papers that cite both, or do their "citation webs" or co-authorship webs rarely overlap?
- investigate the topics and methods of the reuse papers, via metadata. Are they meta-analyses? bioinformatics tool-building papers? clinical validation? microarray-creating studies themselves? A diverse set of uses would be evidence of generativity.
There are limitations to this data, obviously, but I don't think they undermine the analyses since they mostly imply our results are conservative estimates: We are not capturing all instances of reuse:
- only including reuse that is reflected in citations (doesn't include training, use in unpublished research)
- only capturing citations that refer to GEO deposits by their accession number (in this domain data attributions are often citations to the original papers instead, or by mentioning a GEO search criterion. Some of these datasets are also mirrored into other databases, so the data could be cited using a different set of accession numbers)
- only capturing reuse by papers in PubMed Central (a narrow slice of science papers)
Comments from Todd:
- I'm imagining a simple plot of (a) the number of papers depositing in GEO (or the number of deposits, if that is different), to (a) the number of papers citing GEO (or the number of citations to GEO where there may be >1 per paper).
- I like the idea of testing the flatness of contributors relative to reusers. That speaks to the value of data sharing outside of small social groups.