DataONE/Summer 2010/Research questions: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 53: Line 53:




I'll Use the following sources (Please add sources):
I'll Use the following sources (Please add sources)


 
*Journals : SysBio, AmNat
        Journals : SysBio, AmNat
*Repositories: TreeBASE, Genbank, PanGEa
 
*Foundations / Funding Bodies :NSF , JISC, AU ANDS
                Repositories: TreeBASE, Genbank, PanGEa
 
                Foundations / Funding Bodies :NSF , JISC, AU ANDS





Revision as of 11:20, 10 June 2010

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.

DataONE

Home        People        Research        Summer 2010        Resources       


Research Questions and Research Plans

Let's start brainstorming formal research questions, then you can flush out the scope and add your research plans for a June 30th mini-deliverable.

Open Questions for mentors and the community

  • What is a good GIS/earth journal for analysis?
    • Use a journal that is well represented in Pangaea? And/or one affiliated with GSA? --Todd Vision 10:46, 10 June 2010 (EDT)
  • Recommendation for a specific paleontology journal?
    • I would recommend 'Paleobiology' as having broad interest, high impact papers --Todd Vision 10:46, 10 June 2010 (EDT)

Data citation practice inventory within journals (articles)

Owner: Sarah.

  1. What are various practices for data citation within academic papers? How prevalent is each variety?
  2. How do these practices vary across discipline, journal, data type, data source?
  3. How have these practices varied across time?

good broad questions for now, i'm refining more specific questions and how they fit into the broader picture

Scope and Plan

  • which journals? --> Starting with AmNat, SysBio, MolecularEco. Probably will then move to some of the ESA affiliated journals and a GIS/earth journal (need suggestion). - This will give a broad coverage of subject types (in previously mentioned order: behavioral/model, systematics/phylogeny, genetics, ecology, earth/GIS). Then maybe Evolution, Nature, Science b/c big names in biology, but these are more broad coverage, including the previously mentioned journals.
    • We have some survey results on scientist attitudes and behaviours that might sync up nicely with these results if we choose journals that reflect the scientists' fields. When asked "Which of the following best describes your primary field of concentration within evolutionary biology?" the top results were:
      • Behavior/Neurobiology 23%
      • Development/Morphology 21%
      • Ecology 17%
      • Genetics/Genomics 14%
      • Molecular evolution 8%
      • Paleontology 8%
      • Great! thanks for this list. Is there more data from this survey, I'm interested.
    • I don't know which journals best sync up with these fields? see above. i may need a more specific paleontology journal.
  • which time periods? starting with 2010, moving back. probably annually through 2000 and then every five years before. For now, the first issue(s) or 25-50 journals published that year. Should move to random sampling to eliminate the effects of special topic issues. the 2010 preliminary dataset, though not random, is important for investigation of extracted data and trends within a single journal issue.
  • what data will you extract? Still determining fields. Right now, keywords, article topic, dataset citation (Y/N), how data cited, if data is readily accessible, author reciprocally posting their dataset (y/n, same nested questions as with dataset citation). I have an ever expanding spreadsheet...planning on a more refined database or google doc form soon.
  • how many datapoints do you expect? Many! Lots of articles. Planning on 100+ per journal, assuming we pick focal journals. Especially if I can dedicate my time more to this since Valerie has taken depositories which seemed like it was originally under my domain, and because Nic should be able to answer my journal-based questions with the data he is collecting.
  • what stats will you run? what is your statistical power?still need to think on this. Baseline = % of articles in journal that cite a data set, % that do it properly, % that post also post their data, etc. Beyond that, mostly correlations between data citation or lack thereof vs. journal, time, topic (field of concentration), open access, etc. These are relatively simple but may suffice. I'm interested in a more sophisticated method, but am not familiar with traditional statistics in social sciences. Perhaps some multivariate clustering to establish what parameters determine data citation or not. Open to suggestions, especially to common methods in social sciences and specifically data citation (if there are any)! Statistical power should be good b/c large sample size (many more articles than journals), some issues with "unequal sampling" b/c some journals have fewer publications per year/issue.
  • what do you plan to have complete by June 30th?1. Establish WHAT information is collected from each article, 2. Establish HOW information is collected (expedite manual searching, possibly text searching and database automation), 3. Get through 2010 articles of SysBio, AmNat, MolecularEco, 4. Evaluate continued article sampling (random, time-scale, by topic)
  • plans for integration with other intern work?I made brief comments below about collaboration which I hope to update soon. I think a central database would standardize data collection (i.e. fields, character states). Also, this would allow for ease of analysis because an article (or journal or repository) could be evaluated for journal, repository trends/metadata as well (and vice versa for each of our focal areas).

Data sharing and citation policies for journals, funding sources and repositories

Owner: Nic

Description

In this project I will be investigating data management policies for the existence (or absence of) requirements for researchers sharing and citing data. This will be accomplished in two phases. In phase I, I will collect data management policies from a number of journals, repositories and funding sources in order to quantitatively assess data sharing and citation requirements. In phase II, I will be trying to determine the impact of the policies based on correlations with Sarah and Valerie's data.

Scope and Plan

Project will be carried out in two phases

Phase 1 : Collecting and "quantifying" various attributes of policies


I'll Use the following sources (Please add sources)

  • Journals : SysBio, AmNat
  • Repositories: TreeBASE, Genbank, PanGEa
  • Foundations / Funding Bodies :NSF , JISC, AU ANDS


I will collect the following elements from each source (linked to googledoc's SS)


Metadata: Journals, Repositories + Funders

Policy Data

Phase 2 : Determining Impact of Policies

This will be done by correlating my quantified policy data with Valerie and Sarah's reuse data. (More to come)

Data citation practice inventory for repositories

Owner: Valerie

  1. What are all the ways that data housed in given repositories are cited or attributed?
  2. How do these practices vary across discipline, journal, data type, data source?
  3. How have these practices varied across time?

(Very similar to Sarah's project, above)--->*I have some ideas on repository inventory that I haven't been able to explore yet, we should talk about ideas/approaches...I'll post more later, email me if I don't by June 14ish or if you want to collaborate sooner!!!! - Sarah

Link to repository public spreadsheet on Google Docs

Scope and Plan

  • which repositories? TreeBASE, Pangaea, the ORNL DAAC archive ?
  • how will you bound the problem? a subset of repository entries? a subset of journals for citation and attribution links?
  • what methods will be used to search for citation and attributions? using which search resources?
  • what is the estimated coverage of these methods? Could come from Sarah's project results.
  • how many datapoints do you expect?
  • what stats will you run? what is your statistical power?
  • what do you plan to have complete by June 30th?
  • plans for integration with other intern work?
  • plans for integration/parallel analysis with Heather's NCBI GEO work?
    • I'll flush out some background info and this and provide links... feel free to ask in the meantime

More via June 9, 2010 email from Heather

A few more things:

  • do these databases or repositories have "accession numbers"? If so, what is the format of the accession numbers? For example, for NCBI's GEO database, the accession number format is GDSxxxxxx or GSExxxxxx and sometimes people just cite data by mentioning the accession number, so we need to be able to search the article full text for GDS* or GSE*
  • maybe it would be interesting to have add columns for Dryad, Genbank, NCBI's Gene Expression Omnibus Database, and the ArrayExpress database. I say this because it would help us draw comparisons to those sources, even though we probably won't be looking for citations to these databases in the literature
  • it might be worth copying the full paragraphs of full text of the databases' reuse policies into the spreadsheet, for reference. Definitely a link to the page where they discuss their policies would be helpful.