|
|
| (5 intermediate revisions not shown.) |
| Line 1: |
Line 1: |
| - | ==Research Plan== | + | ==Aim== |
| - | ===Overview===
| + | |
| - | * Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
| + | |
| - | * Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
| + | |
| - | * Enumerate the PMC papers that reused GEO data
| + | |
| - | * Estimate what percent of these papers depended on the GEO data for their scientific contribution
| + | |
| | | | |
| - | ===Query details=== | + | ==Background== |
| - | * accession number formats:
| + | |
| - | ** look at both GSE and GDS accession numbers
| + | |
| - | ** use both the raw ID number like 200007572 and the stripped version without the 200... prefix. For example, search for both 200007572 and 7572
| + | |
| - | ** search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"
| + | |
| - | | + | |
| - | ===Exclude data creation studies===
| + | |
| - | * spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
| + | |
| - | ** do this for all the PMC article hits? looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
| + | |
| - | ** could use query from my BioLink paper:
| + | |
| - | (geo OR omnibus)
| + | |
| - | AND microarray
| + | |
| - | AND "gene expression"
| + | |
| - | AND accession
| + | |
| - | NOT (databases
| + | |
| - | OR user OR users
| + | |
| - | OR (public AND accessed)
| + | |
| - | OR (downloaded AND published))
| + | |
| - | ** or the more simple:
| + | |
| - | "gene expression omnibus” AND (submitted OR deposited)
| + | |
| - | ** to do this transparently, query PMC results for each of these words:
| + | |
| - | *** submitted
| + | |
| - | *** deposited
| + | |
| - | *** user*
| + | |
| - | *** public
| + | |
| - | *** accessed
| + | |
| - | *** downloaded
| + | |
| - | *** published
| + | |
| - | | + | |
| - | ===Estimate time lag for reuse===
| + | |
| - | To estimate time lag:
| + | |
| - | * extract year
| + | |
| - | | + | |
| - | ===Estimate what percentage of reusers weren't the original authors===
| + | |
| - | * see if AND pubmed_gds and NOT pmc_gds have any author overlaps? (note AND should be pubmed!)
| + | |
| - | * other idea: institution comparison using medline info
| + | |
| - | * better than submitter, because submitter not the whole story
| + | |
| - | * better than institution, because institution not precise in submission
| + | |
| - | | + | |
| - | ===Estimate what percent of reuse created "new science"===
| + | |
| - | * classify if methods or informatics:
| + | |
| - | ** journal name has informatics
| + | |
| - | ** mesh term for methods?
| + | |
| - | * look at mesh overlap?
| + | |
| - | * look for metaanalysis mesh term?
| + | |
| - | | + | |
| - | ===Estimate what percent of these papers depended on the GEO data for their scientific contribution===
| + | |
| - | ** Any good ideas on how to do this efficiently?
| + | |
| - | *** find those which are/are not in informatics journals
| + | |
| - | *** that use "methods" MeSH terms
| + | |
| - | *** ??
| + | |
| - | | + | |
| - | ===Estimate the fraction of all papers that are in PMC===
| + | |
| - | * use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
| + | |
| - | ** restrict from 2007 to 2009
| + | |
| - | ** result:
| + | |
| - | number of articles in PMC: 6311,
| + | |
| - | number of articles in PubMed: 21569,
| + | |
| - | so PMC contains 29.26% of related papers
| + | |
| - | * so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing
| + | |
| - | | + | |
| - | ==Additional uses of this data collection==
| + | |
| - | * could use this data to see how many publications use any one dataset
| + | |
| - | * could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider... better off with data deposited longer ago
| + | |
| - | * can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year
| + | |
| - | | + | |
| - | ==Notes==
| + | |
| - | * this query PMC full-text approach is similar (as per correspondence with GEO team) to that used by the GEO team to compile the 3rd party reuse page: http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html
| + | |
| - | * would be nice to figure out how to write all of these columns to google docs directly from code
| + | |
| | | | |
| | + | ==Methods== |
| | + | ===Overview=== |
| | + | * Using the method outlined at [[DataONE:Protocols/Find_GEO_reuses]]: |
| | + | ** Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007 |
| | + | ** Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009 |
| | + | ** Enumerate the PMC papers that reused GEO data |
| | + | ** Estimate what percent of these papers depended on the GEO data for their scientific contribution |
| | | | |
| - | ==Open Questions== | + | ===Details=== |
| | + | * see [[DataONE:Protocols/Find_GEO_reuses]] |
| | | | |
| - | ==Limitations== | + | ==Results== |
| - | ===Important for argument===
| + | |
| - | This is a conservative estimate because:
| + | |
| - | * our estimates do not consider reuses after our study timeframe
| + | |
| - | ** many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
| + | |
| - | ** could estimate this impact if we examine data deposited 7 years ago?
| + | |
| - | * Many papers not in PubMed Central
| + | |
| - | ** using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
| + | |
| - | * our methods do not find studies that both create and reuse data
| + | |
| - | ** to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
| + | |
| - | ** we don't have an estimate of how many this is, would require manual inventory
| + | |
| - | * Many data citations not attributed using accession numbers
| + | |
| - | ** don't have a good way to estimate this yet
| + | |
| - | ** would require a manual inventory, similar to Sarah's data citation inventory in DataONE summer 2010 project
| + | |
| - | ** maybe out-of-scope to get this estimate for this project, just admit it is an underestimate
| + | |
| | | | |
| - | ===Less important for argument=== | + | ==Discussion== |
| - | * Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)
| + | |
| - | * Deposits into PMC not stable over time, distribution may change over time
| + | |