DataONE:GEO reuse study/Phase 1: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎Notes: add link)
(flush out methods)
Line 1: Line 1:
==Research Plan==
==Research Plan==
===Overview===
* Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
* Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
* Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
* Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
Line 5: Line 6:
* Estimate what percent of these papers depended on the GEO data for their scientific contribution
* Estimate what percent of these papers depended on the GEO data for their scientific contribution


==Details==
===Query details===
* could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider
* could use this data to see how many publications use any one dataset
* can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year
* accession number formats:
* accession number formats:
** look at both GSE and GDS accession numbers
** look at both GSE and GDS accession numbers
** use both the raw ID number like 200007572 and the stripped version without the 200... prefix.  For example, search for both 200007572 and 7572
** use both the raw ID number like 200007572 and the stripped version without the 200... prefix.  For example, search for both 200007572 and 7572
** search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"
** search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"
===Exclude data creation studies===
* spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
* spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
** do this for all the PMC article hits?  looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
** do this for all the PMC article hits?  looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
* Estimate what percent of these papers depended on the GEO data for their scientific contribution
** could use query from my BioLink paper: 
  (geo OR omnibus)
  AND microarray
  AND "gene expression"     
  AND accession
  NOT (databases
        OR user OR users
        OR (public AND accessed)
        OR (downloaded AND published))
** or the more simple: 
  "gene expression omnibus” AND (submitted OR deposited)
** to do this transparently, query PMC results for each of these words:
*** submitted
*** deposited
*** user*
*** public
*** accessed
*** downloaded
*** published
 
===Estimate time lag for reuse===
To estimate time lag:
* extract year
 
===Estimate what percentage of reusers weren't the original authors===
- see if AND pubmed_gds and NOT pmc_gds have any author overlaps?  (note AND should be pubmed!)
- other idea:  institution comparison using medline info
- better than submitter, because submitter not the whole story
- better than institution, because institution not precise in submission
 
===Estimate what percent of reuse created "new science"===
* classify if methods or informatics:
** journal name has informatics
** mesh term for methods?
* look at mesh overlap?
* look for metaanalysis mesh term?
 
===Estimate what percent of these papers depended on the GEO data for their scientific contribution===
** Any good ideas on how to do this efficiently?
** Any good ideas on how to do this efficiently?
*** find those which are/are not in informatics journals
*** find those which are/are not in informatics journals
*** that use "methods" MeSH terms
*** that use "methods" MeSH terms
*** ??
*** ??
==Uses==
* could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider
* could use this data to see how many publications use any one dataset
* can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year


==Notes==
==Notes==
* this query PMC full-text approach is similar (as per correspondence with GEO team) to that used by the GEO team to compile the 3rd party reuse page:  http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html
* this query PMC full-text approach is similar (as per correspondence with GEO team) to that used by the GEO team to compile the 3rd party reuse page:  http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html
* would be nice to figure out how to write all of these columns to google docs directly from code


==Open Questions==
==Open Questions==
* Ideally want to remove reuses by the same authors or in the same lab.  How to do this efficiently?


==Limitations==
==Limitations==
Line 32: Line 75:
*  our estimates do not consider reuses after our study timeframe
*  our estimates do not consider reuses after our study timeframe
** many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
** many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
* Many papers not in PMC
* our methods do not find studies that both create and reuse data
** to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
* Many papers not in PubMed Central
** use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate?
** use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate?
* Many data citations not attributed using accession numbers  (source for percentages?)
* Many data citations not attributed using accession numbers  (source for percentages?)

Revision as of 07:11, 20 June 2010

Research Plan

Overview

  • Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
  • Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
  • Enumerate the PMC papers that reused GEO data
  • Estimate what percent of these papers depended on the GEO data for their scientific contribution

Query details

  • accession number formats:
    • look at both GSE and GDS accession numbers
    • use both the raw ID number like 200007572 and the stripped version without the 200... prefix. For example, search for both 200007572 and 7572
    • search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"

Exclude data creation studies

  • spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
    • do this for all the PMC article hits? looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
    • could use query from my BioLink paper:
 (geo OR omnibus) 
 AND microarray 
 AND "gene expression"       
 AND accession
 NOT (databases 
        OR user OR users
        OR (public AND accessed) 
        OR (downloaded AND published)) 
    • or the more simple:
 "gene expression omnibus” AND (submitted OR deposited) 
    • to do this transparently, query PMC results for each of these words:
      • submitted
      • deposited
      • user*
      • public
      • accessed
      • downloaded
      • published

Estimate time lag for reuse

To estimate time lag:

  • extract year

Estimate what percentage of reusers weren't the original authors

- see if AND pubmed_gds and NOT pmc_gds have any author overlaps? (note AND should be pubmed!) - other idea: institution comparison using medline info - better than submitter, because submitter not the whole story - better than institution, because institution not precise in submission

Estimate what percent of reuse created "new science"

  • classify if methods or informatics:
    • journal name has informatics
    • mesh term for methods?
  • look at mesh overlap?
  • look for metaanalysis mesh term?

Estimate what percent of these papers depended on the GEO data for their scientific contribution

    • Any good ideas on how to do this efficiently?
      • find those which are/are not in informatics journals
      • that use "methods" MeSH terms
      • ??

Uses

  • could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider
  • could use this data to see how many publications use any one dataset
  • can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year

Notes

  • this query PMC full-text approach is similar (as per correspondence with GEO team) to that used by the GEO team to compile the 3rd party reuse page: http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html
  • would be nice to figure out how to write all of these columns to google docs directly from code


Open Questions

Limitations

Important for argument

This is a conservative estimate because:

  • our estimates do not consider reuses after our study timeframe
    • many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
  • our methods do not find studies that both create and reuse data
    • to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
  • Many papers not in PubMed Central
    • use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate?
  • Many data citations not attributed using accession numbers (source for percentages?)
    • don't have a good way to estimate this yet
    • would require a manual inventory, similar to Sarah's data citation inventory in DataONE summer 2010 project
    • maybe out-of-scope to get this estimate for this project, just admit it is an underestimate

Less important for argument

  • Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)
  • Deposits into PMC not stable over time, distribution may change over time

Brainstorming

In the meantime, here's a quick overview of where I'm headed, with details to follow as I see how it pans out. Feedback welcome, now or later.

I'm thinking that analysis of GEO data reuse could enable some interesting, straight-forward analysis relevant to "value of data that is being lost."

One of the themes of John Wilbanks's talk at the Science Commons symposium was that science, and science data, should be generative (ala Jonathan Zittrain: “Generativity is a systems capacity to produce unanticipated change from unfiltered contributions from broad and varied audiences").

Illustrating that a significant proportion of deposits to GEO have resulted in generative science would imply that data *not* submitted to GEO is a real loss to the scientific community's ability to "produce unanticipated change" and thus a loss of value.

Specifically, here's what I have in mind:

  • Search the full text of all papers in PubMed Central for mention of each GEO accession number. Exclude papers that are in the "primary citation field" for that accession number. (This is the mechanism used to generate GEO's reuse list, I confirmed recently).
  • For each of the PMC articles that reuse GEO data, collect PubMed metadata about authors, MeSH indexing terms, date of publication, etc.
  • For each GEO dataset with at least one reusing article, collect metadata from GEO and PubMed

This would facilitate many possible analyses:

  • as you proposed, investigate if there is indeed a "long tail" of data reuse. Are just a few datasets reused many times, or are many datasets reused?
  • look at the rate of reuse over time, to see if it is accelerating relative to the rate of deposit
  • how often are the reusers from the same institution as the data submitters, and how often is it someone from a different institution or country? (I could imagine * a map here, with arrows from data creation to data use.) Is the distribution of distinct authors and institutions flatter in reuse than in creation, indicating the inclusion of a broad audience?
  • to what extent are the topics of papers that create data the same as the topics of papers that reuse the data? For example, if we clustered the creation and reuse articles, would the creation papers cluster together, separately from the reuse articles, or would the (create, reuse) pairs usually cluster together?
  • to what degree do the creation and reuse papers participate in the same scientific conversations? Are there often papers that cite both, or do their "citation webs" or co-authorship webs rarely overlap?
  • investigate the topics and methods of the reuse papers, via metadata. Are they meta-analyses? bioinformatics tool-building papers? clinical validation? microarray-creating studies themselves? A diverse set of uses would be evidence of generativity.


There are limitations to this data, obviously, but I don't think they undermine the analyses since they mostly imply our results are conservative estimates: We are not capturing all instances of reuse:

  • only including reuse that is reflected in citations (doesn't include training, use in unpublished research)
  • only capturing citations that refer to GEO deposits by their accession number (in this domain data attributions are often citations to the original papers instead, or by mentioning a GEO search criterion. Some of these datasets are also mirrored into other databases, so the data could be cited using a different set of accession numbers)
  • only capturing reuse by papers in PubMed Central (a narrow slice of science papers)

Comments from Todd:

  • I'm imagining a simple plot of (a) the number of papers depositing in GEO (or the number of deposits, if that is different), to (a) the number of papers citing GEO (or the number of citations to GEO where there may be >1 per paper).
  • I like the idea of testing the flatness of contributors relative to reusers. That speaks to the value of data sharing outside of small social groups.