DataONE:ArrayExpress metadata study

From OpenWetWare

(Difference between revisions)
Jump to: navigation, search
(Aim: word fix)
Current revision (22:01, 20 July 2010) (view source)
(fixed link)
 
(10 intermediate revisions not shown.)
Line 3: Line 3:
'''THIS PROJECT IS MID-DEVELOPMENT.  RESULTS HERE ARE UNSTABLE,  INCOMPLETE, AND PERHAPS WILDLY WRONG.'''  That said, please enjoy your reading in the spirit of [http://en.wikipedia.org/wiki/Open_Notebook_Science Open Notebook Science] and I'd love to hear your thoughts and suggestions :)
'''THIS PROJECT IS MID-DEVELOPMENT.  RESULTS HERE ARE UNSTABLE,  INCOMPLETE, AND PERHAPS WILDLY WRONG.'''  That said, please enjoy your reading in the spirit of [http://en.wikipedia.org/wiki/Open_Notebook_Science Open Notebook Science] and I'd love to hear your thoughts and suggestions :)
-
=Aim=
+
*See document (and edit, using link at the bottom of the doc page) at google docs:
-
Is the quantity of metadata that documents a dataset associated with the number of times that dataset is reused? 
+
** https://docs.google.com/document/edit?id=1dKMv_YRq0-D_pqLHc9uvtpYA7jkKGWH2lZkJfaEn2Zs&hl=en
-
 
+
-
Is more compete metadata associated with an increased benefit for investigators in the form of increased citations?
+
-
 
+
-
=Title=
+
-
(tentative)
+
-
 
+
-
'''Is the quantity of metadata around scientific datasets associated with dataset reuse?  A first look.
+
-
'''
+
-
 
+
-
=Background=
+
-
 
+
-
Generating and curating metadata is time-consuming and thus expensive.  As we embrace scientific dataset archiving on a broad scale, the only cost-effective way to generate and curate metadata is to rely on author and automated metadata creation.  Obviously there are costs in terms of attention, focus, opportunity for asking for more metadata than necessary.  Authors may run away (fix), and metadata development teams may spread themselves too thin in terms of design, validation, and maintaining currency.
+
-
 
+
-
How much metadata is the right amount?  Does more metadata result in more useful datasets?  Several ways to estimate value:  surveys, observations, downloads.  We suggest a supplementary analysis:  correlation of metadata fields with documented dataset reuse.
+
-
 
+
-
In this initial analysis, we looked at only the quantity of metadata associated with a scientific dataset, forgoing any assessment of its quality.  We looked at the number of fields populated and the length of free-text resposes, comparing these variables with a rough estimate of how many times the accession number is mentioned in published biomedical literature.  In cases where the data deposit was associated with a published paper, we also studied the association of metadata quantity with the number of times the dataset-creating paper was cited.  Some of these citations may be in the context of dataset reuse, and they are a potentially powerful motivator for authors.
+
-
 
+
-
While this preliminary look has many limitations, we believe it represents a new type of evidence-based analysis that digital curators can use to inform their goals and efforts.
+
-
 
+
-
*cite relevant Dryad and hive pubs, others?  (esp Jane's presentation)
+
-
 
+
-
=Methods=
+
-
==Variables==
+
-
Dependent variables:
+
-
# reuses
+
-
# citations
+
-
 
+
-
Independent variables:
+
-
#ArrayExpress gives its microarray submissions a "MIAME score":  number from 0 to five that quantifies whether the data set has an associated array design, protocol, list of factors, processed data, and raw data.  Quantitative and fairly objective, if slightly superficial.
+
-
# We could attempt to account for confounders by including other independent variables for organism, size of the dataset, impact factor of publishing journal, disease of study, etc
+
-
 
+
-
==Data collection==
+
-
Downloaded ArrayExpress metadata using custom Python code on July 22, 2009.  Open Source: <<link to git>>.  (Note the one year gap.  This was due to an intervening thesis.  Also, updated metadata capture is not necessary because we would be ideally be capturing the metadata that existed at the time reusers would have been searching it... ) 
+
-
 
+
-
Identified ArrayExpress reuse in PubMed Central using the ArrayExpress variant of the [[DataONE:Protocols/Find_GEO_reuses]] protocol.  Reuses captured on July 19, 2010
+
-
 
+
-
Downloaded Scopus citation counts for the PMIDs listed in the ArrayExpress metadata.  Collected on July 19, 2010 using the [[DataONE:Protocols/Scopus_citation_counts_from_PMIDs]] protocol.
+
-
 
+
-
==Stats==
+
-
* log-linear regression?  Or ideally some more sophisticated stats that would account for the censored nature of the data, but I'm not handy with them yet.
+
-
 
+
-
=Results=
+
-
 
+
-
* <<link to ArrayExpress metadata dataset>>
+
-
* <<link to ArrayExpress reuse dataset>>
+
-
* <<link to Scopus data>>
+
-
 
+
-
=Discussion=
+
-
 
+
-
==Limitations==
+
-
There are many limitations of this preliminary analysis:
+
-
* nothing about the quality of the metadata
+
-
* direction of causation, or third related concept... maybe higher quality/more useful datasets create more metadata
+
-
* demonstrated reuse is not the only dimension of value.  Metadata may not be correlated with increased usage, but it may decrease the amount of time that investigators spend finding the data they need and/or eliminating the data they don't need
+
-
* didn't eliminate same-author reuses of data (in the interests of time... this could be done....) which would presumably be unrelated to metadata content
+
-
* limitation in using these results to direct what metadata to collect:  of course the metadata people use today may not be the metadata that will be most useful to scientists 20 years from now
+
-
 
+
-
==Future work==
+
-
* Text analysis of metadata fields for content, ala Chris Taylor's work(http://www.nature.com/nbt/journal/v26/n8/abs/nbt0808-889.html); Atul Butte's work (http://www.ncbi.nlm.nih.gov/pubmed/16404398)
+
-
* would be interesting to correlate with downloads
+

Current revision

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.

Home        People        Research        Summer 2010        Resources       


THIS PROJECT IS MID-DEVELOPMENT. RESULTS HERE ARE UNSTABLE, INCOMPLETE, AND PERHAPS WILDLY WRONG. That said, please enjoy your reading in the spirit of Open Notebook Science and I'd love to hear your thoughts and suggestions :)

Personal tools