DataONE/Summer 2010/Research questions: Difference between revisions
Line 100: | Line 100: | ||
** ''I'll flush out some background info and this and provide links... feel free to ask in the meantime'' | ** ''I'll flush out some background info and this and provide links... feel free to ask in the meantime'' | ||
I will start looking for the following ways that TreeBASE is cited in articles: | |||
# Mention of TreeBASE or TreeBANK | # Mention of TreeBASE or TreeBANK | ||
# DOI or URI | # DOI or URI | ||
Line 119: | Line 112: | ||
* This information will go into a spreadsheet housed here: [http://spreadsheets.google.com/ccc?key=0AgM1E1R2tI_6dE1LYlYtWHRXblNXa3ladXNNY3BDbEE&hl=en TreeBASE Citations] | * This information will go into a spreadsheet housed here: [http://spreadsheets.google.com/ccc?key=0AgM1E1R2tI_6dE1LYlYtWHRXblNXa3ladXNNY3BDbEE&hl=en TreeBASE Citations] | ||
* My observations will also be posted here [[DataONE:Notebook/Reuse_of_repository_data|Reuse of repository data]] | * My observations will also be posted here [[DataONE:Notebook/Reuse_of_repository_data|Reuse of repository data]] | ||
Revision as of 12:14, 11 June 2010
This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.
Research Questions and Research Plans
Let's start brainstorming formal research questions, then you can flush out the scope and add your research plans for a June 30th mini-deliverable.
Open Questions for mentors and the community
- The students don't have access to all the journals they need through their home institutions (eg Simmons Collage). Can we set up some guest access to other DataONE-affiliated resources?
- What is a good GIS/earth journal for analysis?
- Use a journal that is well represented in Pangaea? And/or one affiliated with GSA? --Todd Vision 10:46, 10 June 2010 (EDT)
- Recommendation for a specific paleontology journal?
- I would recommend 'Paleobiology' as having broad interest, high impact papers --Todd Vision 10:46, 10 June 2010 (EDT)
Data citation practice inventory within journals (articles)
Owner: Sarah.
- What are various practices for data citation within academic papers? How prevalent is each variety?
- How do these practices vary across discipline, journal, data type, data source?
- How have these practices varied across time?
good broad questions for now, i'm refining more specific questions and how they fit into the broader picture
Scope and Plan
- which journals? --> Starting with AmNat, SysBio, MolecularEco. Probably will then move to some of the ESA affiliated journals and a GIS/earth journal (need suggestion). - This will give a broad coverage of subject types (in previously mentioned order: behavioral/model, systematics/phylogeny, genetics, ecology, earth/GIS). Then maybe Evolution, Nature, Science b/c big names in biology, but these are more broad coverage, including the previously mentioned journals.
- We have some survey results on scientist attitudes and behaviours that might sync up nicely with these results if we choose journals that reflect the scientists' fields. When asked "Which of the following best describes your primary field of concentration within evolutionary biology?" the top results were:
- Behavior/Neurobiology 23%
- Development/Morphology 21%
- Ecology 17%
- Genetics/Genomics 14%
- Molecular evolution 8%
- Paleontology 8%
- Great! thanks for this list. Is there more data from this survey, I'm interested.
- I don't know which journals best sync up with these fields? see above. i may need a more specific paleontology journal.
- We have some survey results on scientist attitudes and behaviours that might sync up nicely with these results if we choose journals that reflect the scientists' fields. When asked "Which of the following best describes your primary field of concentration within evolutionary biology?" the top results were:
- which time periods? starting with 2010, moving back. probably annually through 2000 and then every five years before. For now, the first issue(s) or 25-50 journals published that year. Should move to random sampling to eliminate the effects of special topic issues. the 2010 preliminary dataset, though not random, is important for investigation of extracted data and trends within a single journal issue.
- what data will you extract? Still determining fields. Right now, keywords, article topic, dataset citation (Y/N), how data cited, if data is readily accessible, author reciprocally posting their dataset (y/n, same nested questions as with dataset citation). I have an ever expanding spreadsheet...planning on a more refined database or google doc form soon.
- how many datapoints do you expect? Many! Lots of articles. Planning on 100+ per journal, assuming we pick focal journals. Especially if I can dedicate my time more to this since Valerie has taken depositories which seemed like it was originally under my domain, and because Nic should be able to answer my journal-based questions with the data he is collecting.
- what stats will you run? what is your statistical power?still need to think on this. Baseline = % of articles in journal that cite a data set, % that do it properly, % that post also post their data, etc. Beyond that, mostly correlations between data citation or lack thereof vs. journal, time, topic (field of concentration), open access, etc. These are relatively simple but may suffice. I'm interested in a more sophisticated method, but am not familiar with traditional statistics in social sciences. Perhaps some multivariate clustering to establish what parameters determine data citation or not. Open to suggestions, especially to common methods in social sciences and specifically data citation (if there are any)! Statistical power should be good b/c large sample size (many more articles than journals), some issues with "unequal sampling" b/c some journals have fewer publications per year/issue.
- what do you plan to have complete by June 30th?1. Establish WHAT information is collected from each article, 2. Establish HOW information is collected (expedite manual searching, possibly text searching and database automation), 3. Get through 2010 articles of SysBio, AmNat, MolecularEco, 4. Evaluate continued article sampling (random, time-scale, by topic)
- plans for integration with other intern work?I made brief comments below about collaboration which I hope to update soon. I think a central database would standardize data collection (i.e. fields, character states). Also, this would allow for ease of analysis because an article (or journal or repository) could be evaluated for journal, repository trends/metadata as well (and vice versa for each of our focal areas).
Data sharing and citation policies for journals, funding sources and repositories
Owner: Nic
Description
In this project I will be investigating data management policies for the existence (or absence of) requirements for researchers sharing and citing data. This will be accomplished in two phases. In phase I, I will collect data management policies from a number of journals, repositories and funding sources in order to quantitatively assess data sharing and citation requirements. In phase II, I will be trying to determine the impact of the policies based on correlations with Sarah and Valerie's data.
Scope and Plan
Project will be carried out in two phases
Phase 1 : Collecting and "quantifying" various attributes of policies
I'll Use the following sources (Please add sources)
- Journals : SysBio, AmNat, MolecularEco
- Repositories: TreeBASE, Genbank, PanGEa
- Foundations / Funding Bodies :NSF , JISC, AU ANDS
I will collect the following elements from each source (linked to googledoc's SS)
Metadata: Journals, Repositories + Funders
Phase 2 : Determining Impact of Policies
This will be done by correlating my quantified policy data with Valerie and Sarah's reuse data. (More to come)
Deliverables
As of 6/30--(If scope seems narrow please comment)
- policies retrieved and data / metadata extracted for sources
- comprehensive list of funders and metadata for resources Sarah and Valerie are working with
Data citation practice inventory for repositories
Owner: Valerie
- What are all the ways that data housed in given repositories are cited or attributed?
- How do these practices vary across discipline, journal, data type, data source?
- How have these practices varied across time?
(Very similar to Sarah's project, above)--->*I have some ideas on repository inventory that I haven't been able to explore yet, we should talk about ideas/approaches...I'll post more later, email me if I don't by June 14ish or if you want to collaborate sooner!!!! - Sarah
Link to repository public spreadsheet on Google Docs
Scope and Plan
- which repositories? TreeBASE, Pangaea, the ORNL DAAC archive ?
- how will you bound the problem? a subset of repository entries? a subset of journals for citation and attribution links?
- what methods will be used to search for citation and attributions? using which search resources?
- what is the estimated coverage of these methods? Could come from Sarah's project results.
- how many datapoints do you expect?
- what stats will you run? what is your statistical power?
- what do you plan to have complete by June 30th?
- plans for integration with other intern work?
- plans for integration/parallel analysis with Heather's NCBI GEO work?
- I'll flush out some background info and this and provide links... feel free to ask in the meantime
I will start looking for the following ways that TreeBASE is cited in articles:
- Mention of TreeBASE or TreeBANK
- DOI or URI
- Full citation as per TreeBASE recommendations.
- Mention of data author only
I will look in the following databases and journals:
- ISI Web of Science
- Scirus
- Nature
- This information will go into a spreadsheet housed here: TreeBASE Citations
- My observations will also be posted here Reuse of repository data
Milestones
- 6/30/2010: completed spreadsheet and report summarizing findings