DataONE:ArrayExpress metadata study: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎Discussion: add future work on text analysis for free text content)
(fixed link)
 
(14 intermediate revisions by the same user not shown)
Line 3: Line 3:
'''THIS PROJECT IS MID-DEVELOPMENT.  RESULTS HERE ARE UNSTABLE,  INCOMPLETE, AND PERHAPS WILDLY WRONG.'''  That said, please enjoy your reading in the spirit of [http://en.wikipedia.org/wiki/Open_Notebook_Science Open Notebook Science] and I'd love to hear your thoughts and suggestions :)
'''THIS PROJECT IS MID-DEVELOPMENT.  RESULTS HERE ARE UNSTABLE,  INCOMPLETE, AND PERHAPS WILDLY WRONG.'''  That said, please enjoy your reading in the spirit of [http://en.wikipedia.org/wiki/Open_Notebook_Science Open Notebook Science] and I'd love to hear your thoughts and suggestions :)


=Aim=
*See document (and edit, using link at the bottom of the doc page) at google docs:
Does the quantity of metadata associated with a dataset correlate with the number of times that dataset is reused? 
** https://docs.google.com/document/edit?id=1dKMv_YRq0-D_pqLHc9uvtpYA7jkKGWH2lZkJfaEn2Zs&hl=en
 
Is more compete metadata associated with an increased benefit for investigators in the form of increased citations?
 
=Background=
=Methods=
==Variables==
Dependent variables:
# reuses
# citations
 
Independent variables:
#ArrayExpress gives its microarray submissions a "MIAME score":  number from 0 to five that quantifies whether the data set has an associated array design, protocol, list of factors, processed data, and raw data.  Quantitative and fairly objective, if slightly superficial.
# We could attempt to account for confounders by including other independent variables for organism, size of the dataset, impact factor of publishing journal, disease of study, etc
 
==Data collection==
Downloaded ArrayExpress metadata using custom Python code on July 22, 2009.  Open Source: <<link to git>>.  (Note the one year gap.  This was due to an intervening thesis.  Also, updated metadata capture is not necessary because we would be ideally be capturing the metadata that existed at the time reusers would have been searching it... ) 
 
Identified ArrayExpress reuse in PubMed Central using the ArrayExpress variant of the [[DataONE:Protocols/Find_GEO_reuses]] protocol.  Reuses captured on July 19, 2010
 
Downloaded Scopus citation counts for the PMIDs listed in the ArrayExpress metadata.  Collected on July 19, 2010 using the [[DataONE:Protocols/Scopus_citation_counts_from_PMIDs]] protocol.
 
==Stats==
* log-linear regression?  Or ideally some more sophisticated stats that would account for the censored nature of the data, but I'm not handy with them yet.
 
=Results=
 
* <<link to ArrayExpress metadata dataset>>
* <<link to ArrayExpress reuse dataset>>
* <<link to Scopus data>>
 
=Discussion=
 
==Future work==
* Text analysis of metadata fields for content, ala Chris Taylor's work(http://www.nature.com/nbt/journal/v26/n8/abs/nbt0808-889.html); Atul Butte's work (http://www.ncbi.nlm.nih.gov/pubmed/16404398)

Latest revision as of 20:01, 20 July 2010

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.

DataONE

Home        People        Research        Summer 2010        Resources       


THIS PROJECT IS MID-DEVELOPMENT. RESULTS HERE ARE UNSTABLE, INCOMPLETE, AND PERHAPS WILDLY WRONG. That said, please enjoy your reading in the spirit of Open Notebook Science and I'd love to hear your thoughts and suggestions :)