DataONE:Notebook/ArticleCitationPractices:Analysis

From OpenWetWare
Revision as of 17:16, 21 July 2010 by Sarah Judson (talk | contribs)
Jump to navigationJump to search

Preliminary Analysis

Outlined plan

  • emphasis on: is quality data reuse/sharing happening + HOW is data reuse/sharing happening
  • Criteria (reuse)
    • attribution
      • x percent cited original data authors in the biblio (attribution)
      • x percent of above were reuses of personal data (figure out a way to discern without credit to original authors (previous review) vs. previous study)
        • self reuses typically indicated in "Aquired" column, but should also be a combo of "How cited" = au and "Biblio" = 0
    • acquirement/resolvability
      • x percent mentioned the depository (acquirement)
      • x percent give accession (acquirement)
      • x percent only depository, x percent only accession
      • x percent both
    • if these correlate to journal or datatype (and maybe open access)
    • stats = Anova: of percent (total?) YN vs. journal/datatype
    • how this has changed/improved by journal from 2000 to 2010 (and in more detail from 2005-2010 in amnat/sysbio)
  • Scoring (reuse)
    • resolvable (depository + accession)
    • attribution (author year + accession + biblio, not self) - i.e. gives credit to the author and the data
    • ideal = resolvable + attribution
    • meets journal/depository recommendations
    • stats = ordinal regression/loglinear model (score vs. journal/datatype)
  • Criteria (sharing)
    • x percent mention depository
    • x percent give accession
    • x percent do both
    • x percent share all data
    • if these correlate to journal or datatype (and maybe open access)
    • stats = Anova: of percent (total?) YN vs. journal/datatype
  • Scoring (sharing)
    • percent shared vs. produced
      • stats = anova - above metric is continuous, journal/datatype is categorical
    • resolvable (depository + accession)
  • discussion
    • what aspects of a citation (reuse or sharing) are most commonly missing? (pre-analysis, I would say accession, and especially author+accession....many

either/or)

    • from observing current practices, we recommend a, b, and c" for policies and authors
      • see Knoxville ppt (i.e. methods section, most do so advise that all do = easier for copyeditors, authors, and reusers; preference for accession tables

with author reference...give example)

      • report that no biblio citations had hdl/doi/accession of data --> suggest data citation biblio format
      • report attribution and resolvability problems
        • appendix tables with citations = not linked to ISI, etc for tracking
        • dead urls (or difficult to navigate)
        • Migrated datasets (treebase and sysbio problems in particular)

For Knoxville

  • Snapshot and Time series of SysBio and AmNat
    • graphs/tables
      • side by side comparisons of sysbio and amnat
      • % reuse per year (per issue for snapshot)
      • which depositories used (and frequency)
      • % "proper" citation
      • type of data correlated with reuse
  • comments on Molecular Eco and Ecology snapshots
  • Anecdotal stuff

Final Analysis Ideas

  • supplementary data vs. externally vs. deposited vs. produced (ratios of YN for paperdatasetcited, supplementary data, and data produced
  • get opinions: maybe calculate a score of how well a dataset was cited (minus poinrs for unspecified, plus double points for accession)
  • data produced vs. data shared/reuse may be a more fair metric than just reuse/share y/n
  • possibilities:
    • Percentages
      • % extinct urls from personal/other share list (illustrates one reason depositories should be employed)
      • % reuse per journal per year (or per discipline, funder, etc)
      • % sharing per journal per year
      • % sharing vs. % produced
      • % that could have been put in relevant depository but weren't (especially treebase)
    • Scoring
      • "quality" reuse citation (journal/repository specific?)
      • % sharing vs. % produced
    • Correlations
      • dataset type (or journal, discipline, nationality, funding type) to YN/quality reused/shared
      • open access to data reuse/sharing
      • something with multiple datasets
    • statistics: ordinal regression, anova, clustering, correlation
    • additional statistics to do: percent improvement from 2000 to 2010 (would be a better measure for journals like ecology which have few reuses); could do this
  • for citation quality, incidences of reuse, and incidences of sharing/or sharing/production %
  • also (similar) % increase in utilization of depositories (esp. treebase)-->but then in discussion, state that treebase still under utilized judging by amount of pt and ga prodcuced that aren't posted. (? could depositories be more active and contact authors to deposit in them after they see a paper published? or accepted....this could capitalize on relationships with editoral boards of journals)