DataONE:Notebook/Summer 2010/2010/07/22: Difference between revisions

Revision as of 15:41, 23 July 2010

Email:Scoring and Stats questions_Dataone

17 messages Sarah Walker Judson <walker.sarah.3@gmail.com> Wed, Jul 21, 2010 at 5:15 PM To: Heather Piwowar <hpiwowar@gmail.com> Heather -

I'm running into some hurdles with my data analysis. First, I want to confirm my scoring categories with you and second, as I got deeper into the statistics, I realized that most of my experience has been with continuous, not categorical data...so, I need your opinion on some things. And just to warn you , this is a rather lenghty email. I can send my Excel and R files if it's easier, but those are messy at best. On second though, I'll post them to OWW (Excel here, R here)

So, first, the scoring categories: For each, I have a binomial (YN) and ordinal (scored) version. I'm still not sure which is best for each aspect. I'd like your opinion on binomial vs. ordinal and the scoring levels I've proposed. I'm more inclined to scoring because it gives a more detailed picture of what is happening in the data, but I also worry that I've made too many categories. At this point, these are just for reuse, I'm planning on coding Sharing tomorrow in a similar manner.

1. Resolvability (Could the dataset be retrieved from the information provided?) ResolvableYN

  1=Y=Depository and Accession
0=N=lacking one or the other or both

ResolvableScore 0=no Depository or Accession or Author (Justification: you know they used data but not exactly how it was obtained = probably couldn't find it again…i.e. "data was obtained from the published literature")

1=Author Only (Justification: you could track down the original paper which might contain an accession or extractable info) 
2=Depository or Database Only (Justification: You might be able to look up the same species/taxon and find the information per the criteria in the methods) 
3=Accession Only (Justification: Accession number given but depository not specified = you would probably be able to infer which depository it came from based on format, just as I was usually able to tell that they were genbank sequences by the format even though genbank was enver mentioned anywhere in the paper) 
4=Depository and Author (Justification: Although no accession given, many depositories also have a search option for the author/title of the original article which connects to the data) 
5=Depository and Accession (Justification: "Best" resolvability….unique id and known depository = should find exact data that was used)

2. Attribution (Is proper credit given to original data authors?) AttributionYN

1=Y=author and Accession (biblio assumed)
0=N=lacking one or the other or both

AttributionScore*I have two alternatives for this: one that doesn't worry about "self" citations and counts them the same as others (i.e. combines 6&7 and 4&5), and another that throws out all the "self" citations (and cuts my already small sample size by a lot!)

0=no author/biblio or accession 
1=self citation, other reuse (Justification: Author refers to a previous review paper of theirs, but not the original data authors…this assumes that the original data authors are attributed in the previous paper) 
2=organization or URL only (Justification: data collectors/project acknolwedged, but not specific individuals or relevant publications) 
3= accession only (Justification: Data acknolwedged) 
4=author/biblio, but self 
5=author/biblio only, not self (Justification: original data author acknowledged and this is the currently accepted mode of attribution 
6=author + accession, but self 
7=author + accession, not self (Justfication: attribution to author and data…and this is the mode of attribution we hope for)

3. Ideal (previously "Good") citation score Ideal_CitationYN* This came out the same as my Knoxville calculation of author+depository+accession 1=Y=Resolvable + Attribution (adding the two previous yes and no categories) 0=N=lacking one or the other or both

Ideal_CitationScoreSimple0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") 2=resolvable ("Yes" in "ResolvableYN") 3=attributed and resolvable

Ideal_CitationScoreGoodGradient0=none 1=depository only 2=author only 3=accession only 4=depository and author 5=depository and accession 6=author and accession 7=depository, author, and accession

And now, the stats. While in Knoxville, we talked about doing an ordinal regression. However, as I tried it out and read up on it, I think it's the wrong match for this data. I think it is primarily used for survey analysis where you are looking for an interaction of two variables both ranked on an ordinal scale (i.e. do people that "strongly agree" with a political issue also classify themselves as "strongly conservative"). Case in point, this R example. Maybe I'm reading the literature wrong, but I don't think this is the right fit...let me know what you think (and if you have a good resource!)

Instead, I think I should be using either chi-squared or a linear model for categorical data (i.e. binomial or Poisson distribution). I've run both and am a little stuck on interpretation, but can figure that out. I wanted to get your opinion on using chisq vs. a Poisson glm. The main pros for chi = easy and p-value output; the main con = low resolution (i.e. it says something is up but not what). The main pros for Poisson = higher resoluation (i.e. specifies what factors are most significant/influential) and has ability to look at multiple factors at once (then arriving at the "best" model of combined explanatory factors based on AIC); the main con for Poisson = I'm not super up to speed on the interpretation.

Here is a summary of some of my R analysis (I can send you the R history or TinnR file if you want) on the Resolvability aspect to give you an idea of the outputs.

1. Tables
> b=table(ResolvableYN,DatasetType);b
DatasetType
ResolvableYN Bio Ea Eco GA GIS GO GS PA PT XY
0 22 35 10 5 10 5 33 19 8 5
1 0 0 0 0 0 0 17 0 1 0
> bb=table(ResolvableYN,BroaderDatatypes);bb
BroaderDatatypes
ResolvableYN EA Eco G O PT S
0 35 10 43 41 8 15
1 0 0 17 0 1 0

#> cc=table(ResolvableScore,DatasetType);cc

DatasetType
ResolvableScore Bio Ea Eco GA GIS GO GS PA PT XY
0 6 15 3 0 4 0 3 4 1 3
1 13 10 6 2 2 4 12 10 5 0
2 1 8 0 2 2 1 6 3 0 0
3 0 0 0 0 0 0 4 0 0 0
4 2 2 1 1 2 0 8 2 2 2
5 0 0 0 0 0 0 17 0 1 0
> ccc=table(ResolvableScore,BroaderDatatypes);ccc
BroaderDatatypes
ResolvableScore EA Eco G O PT S
0 15 3 3 10 1 7
1 10 6 18 23 5 2
2 8 0 9 4 0 2
3 0 0 4 0 0 0
4 2 1 9 4 2 4
5 0 0 17 0 1 0

1. Chi-Squared
> chisq.test(table (ResolvableScore,DatasetType))
Pearson's Chi-squared test
data: table(ResolvableScore, DatasetType)
X-squared = 98.1825, df = 45, p-value = 7.922e-06
Warning message:
In chisq.test(table(ResolvableScore, DatasetType)) :
Chi-squared approximation may be incorrect

1. Linear Model-Poisson (alternative = binomial or zero inflated for "Resolvable YN")
> poisson = glm(ResolvableScore~DatasetType,data=a,family=poisson)
> summary(poisson)
Call:
glm(formula = ResolvableScore ~ DatasetType, family = poisson,
data = a)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.47386 -1.37229 -0.04478 0.60369 2.29458
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.04445 0.20851 0.213 0.8312
DatasetTypeEa -0.07344 0.26998 -0.272 0.7856
DatasetTypeEco -0.04445 0.37878 -0.117 0.9066
DatasetTypeGA 0.64870 0.37879 1.713 0.0868 .
DatasetTypeGIS 0.29202 0.33898 0.861 0.3890
DatasetTypeGO 0.13787 0.45842 0.301 0.7636
DatasetTypeGS 1.07396 0.22364 4.802 1.57e-06 ***
DatasetTypePA 0.18916 0.29180 0.648 0.5168
DatasetTypePT 0.64870 0.31470 2.061 0.0393 *
DatasetTypeXY 0.42555 0.41042 1.037 0.2998
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 283.03 on 169 degrees of freedom
Residual deviance: 212.36 on 160 degrees of freedom
AIC: 566.94
Number of Fisher Scoring iterations: 5

1. Ordered Logit Model (library MASS, function polr)
polr(formula = as.ordered(ResolvableScore) ~ DatasetType, data = a)
Coefficients:
Value Std. Error t value
DatasetTypeEa -0.1696435 0.4978595 -0.3407457
DatasetTypeEco -0.1249959 0.6788473 -0.1841296
DatasetTypeGA 1.3932769 0.8160008 1.7074454
DatasetTypeGIS 0.2656731 0.7333286 0.3622839
DatasetTypeGO 0.6023778 0.8112429 0.7425369
DatasetTypeGS 2.3786340 0.4917395 4.8371833
DatasetTypePA 0.3831136 0.5557081 0.6894152
DatasetTypePT 1.0288666 0.7284316 1.4124410
DatasetTypeXY -0.2736481 1.1134912 -0.2457569
Intercepts:
Value Std. Error t value
0|1 -0.6708 0.3879 -1.7294
1|2 1.2789 0.4015 3.1855
2|3 2.0552 0.4237 4.8511
3|4 2.2097 0.4285 5.1573
4|5 3.3495 0.4775 7.0144
Residual Deviance: 483.3118
AIC: 511.3118

Sincerely, Sarah Walker Judson

P.S. I'm planning to post this to OWW, just thought email would be the best mode of communication for these questions at this point. Sarah Walker Judson <walker.sarah.3@gmail.com> Wed, Jul 21, 2010 at 7:23 PM To: Heather Piwowar <hpiwowar@gmail.com> And here's a bunch of pivot tables that help show the trends....helps put the stats/scoring, etc. in perspective.

Sincerely, Sarah Walker Judson [Quoted text hidden]

DataSets_Pivot.xls 1296K Heather Piwowar <hpiwowar@gmail.com> Wed, Jul 21, 2010 at 9:22 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com>, Todd Vision <tjv@bio.unc.edu> Sarah,

I wasn't quite sure where to respond on OWW, so feel free to copy my email comments over. Also, CCing Todd because he knows his stats and I want to make sure I send you down the right path.

Thanks for the full summary... it made it very easy to understand your question and think about the alternatives.

I hear you on the ordered having more information in it, so let's try to find some statistics to take advantage of that. Your levels make sense. You do have finely grained levels... this may be a problem if it makes the data too sparse for good estimates. Your crosstabs suggest it is pretty sparse across your covariates of interest. I'd go with it as is for now (I saw a paper that argues for maintaining lots of levels even in small datasets), but keep the fact that it is sparse in mind... an obvious fix is to collapse some levels if it seems necessary.

Stats. I'm not very familiar with Poisson in this context either. I could see how chi squared could be applied, but agree that it seems to leave a lot of the power on the table. And its output isn't very informative, as you said.

So let's go back and think about ordinal logistic regression again. I think that may still be quite appropriate. Here's the best document that I could find... this group does great R writeups. http://www.ats.ucla.edu/stat/r/dae/ologit.htm

What do you think? I think that your levels are definitely analogous to a likert scale, or soft drink sizes.

This example helps with gut feel as well: http://www.uoregon.edu/~aarong/teaching/G4075_Outline/node27.html

The two approaches in R seem to be MASS::polr and Hmisc::lrm

I'd probably go with the latter because that is what the cool tutorial above uses :)

What do you think? After reading these refs do you still feel like it isn't appropriate? If so, let's talk it through.... I'm avail on chat tomorrow most of the day (multitasking with a remote meeting) so feel free to initiate chat whenever. Heather

On Wed, Jul 21, 2010 at 5:15 PM, Sarah Walker Judson <walker.sarah.3@gmail.com> wrote: [Quoted text hidden]

Sarah Walker Judson <walker.sarah.3@gmail.com> Thu, Jul 22, 2010 at 12:12 PM To: hpiwowar@gmail.com Cc: Todd Vision <tjv@bio.unc.edu> Heather -

Thank you very much for the prompt help!

The UCLA link was very helpful and interesting....cool stuff. I ran my data following the tutorial. I didn't have any problems running it, but my data clearly violates a number of the assumptions:

1. Small cells/empty cells: because of the number of categories, I had many zeros or small values in my crosstabs. They warn against this saying the model either won't run at all or be unstable....I'm not clear what they mean by "unstable". Mine ran, but I don't know if we can trust the results. (See attached "OrdinalLogisticOutput" for results.)

2. Proportional odds assumption: My data did not hold up to either the parallel slopes or plot tests of this (see attached .txt and .jpg).

3. Sample size. There example was with 400, plus it was mostly binary, so the samples weren't splayed out among many categories. Mine is 170 (for just the 2000/2010 comparision....I have about 100 more if I pool the 2000/2010 "snapshots" and the Time Series) and distributed among a lot of potential categories.

In general, I'm still concerned about the nature of my data in this analysis. They give two examples at the top that match my data, but then the one they use as an example is with more progressive/scalar categories (i.e. your parents had no education, the next "logical" step is that they did get an education). Mine on the other hand is A, B, and C which have no relation to each other....i.e. journal 1, journal 2, and journal 3 or datatype A, B, and C. I don't know if I'm articulating that well, but from their example, I think my type of data would work, but I'm unclear how I would interpret the results. Especially the coefficients....for example, the UCLA example says "So for pared [parent education level], we would say that for a one unit increase in pared (i.e., going from 0 to 1), we expect a 1.05 increase in the expect value of apply [likelihood of applying to grad school] on the log odds scale, given all of the other variables in the model are held constant." I don't know how you would interpret this from journal to journal or datatype to datatype.

Also, I would get the books they recommend to figure out the best approach/interpretation, but I'm operating out of the world's smallest library (seriously, smaller than my apartment) and don't know other ways to obtain the books besides the limited previews on google books. Even more so, I'm almost positive it would take me over a week to get them short of a road trip to LA or thereabouts.

I'm on gchat (just invisible) if you want to hash things out now. I decided to email so I could articulate and ponder over my thoughts better. Thanks again for your help now and throughout this project!

Sincerely, Sarah Walker Judson [Quoted text hidden]

3 attachments Test_ProporitionalOddsAssumption.jpeg 52K ParallelSlopeTable.txt 5K OrdinalLogisticOutput.txt 2K Heather Piwowar <hpiwowar@gmail.com> Thu, Jul 22, 2010 at 12:36 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Hi Sarah,

Nice job on the fast analysis and thoughtful interpretation (or interpretation attempts, as the case may be).

I will read and think and respond in the next few hours.

fwiw I do have "Frank E. Harrell, Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression and Survival Analysis. Springer, New York, 2001. [335-337]" at home (it is a great book in many ways! recommended) and would be happy to zoom any relevant pages to you.... will see if that is helpful.

More soon, Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Thu, Jul 22, 2010 at 2:14 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Sarah,

I'm going to write in email too, to help organize my thoughts.

These responses are going to sound like I'm arguing for the ordinal regression. I'm not, per se.... just trying to fully trying to see if it can work before we go to something else.

1. Small cells/empty cells Agreed, a potential problem. I think it would be a potential problem for most statistical techniques because hard to estimate from little information. That said, there are some algorithms that are designed to deal well with this, like Fisher's test for chi squared.

I think by "unstable" they mean very sensitive to individual datapoints. One way to test this is to do a loop wherein you exclude a datapoint, recompute, see if it changed anything drastically. I don't think this makes sense to do first, but we can plan to do it at the end if we are worried about potential instability.

2. Proportional odds assumption

I'm going to do a bit more reading here. There are related algorithms for non-proportional ordinal regressions, though no obvious best choices in R....

another idea might be to collapse the levels into 3 or 4 and see if they are more proportional then (since the chance of having 6 things happen to be proportional is lower than 3 things, more sensitive to outliers...)

3. Sample size I'd add in the extra 100. Also, there is no suggestion that their sample size was a minimum.... That said, I agree, we are trying to estimate a lot of parameters based on not very much data. Rules of thumb are always tricky, and they depend on estimates of effect size, which of course we don't know yet. That said, a rule of thumb is to have 30 datapoints for every multivariate coefficient you are trying to estimate. 6 levels takes up 6, 6*30=180, and that is before estimating anything for your covariates.

So maybe another argument to collapse your levels down to 3 for now???

4. Factor/categorical variables. Yup, your journals and subdiscliplines are factors. I don't believe this will cause a problem. I would model them with dummy variables (one variable for each of your journals and subdisciplines, binary 0/1). Of course that is a lot of covariates, but I think that is the only way to have interpretable results.

A bit on dummy variables here: http://www.psychstat.missouristate.edu/multibook/mlt08m.html

I know that the Design library often does smart things with factor variables, too.... so before you create dummy variables you could try redefining your journal variable as a factor, feed that in, and see what it does.... "If you have constructed those variables as factors, the regression functions in R will interpret them correctly, i.e. as though the dummies were in there. " as per here.

ok, I'm going to end my stream of consciousness there, do a bit more reading, then find you for an interactive chat. Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Thu, Jul 22, 2010 at 2:24 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Sarah,

This link may be of interest too. It is the parallel tutorial to the "ordered" regression one.... this is for regressing against categories/factors where there is no order. Not what you are doing and so it probably loses a lot of power, but it definitely doesn't have a proportional odds assumption!

could be informative to give it a try on your data for fun if easy, pretending that your levels were all distinct unrelated labels?

http://www.ats.ucla.edu/stat/R/dae/mlogit.htm Heather

[Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com> Thu, Jul 22, 2010 at 3:49 PM To: hpiwowar@gmail.com Sorry, one question I forgot (but isn't urgent since I have a lot to chew on): should I even attempt the "ideal citation" score, or just worry about the resolvability and attribution components?

To reiterate (it's kind of a combination of resolvability and attribution): Ideal (previously "Good") citation score Ideal_CitationYN* This came out the same as my Knoxville calculation of author+depository+accession 1=Y=Resolvable + Attribution (adding the two previous yes and no categories) 0=N=lacking one or the other or both

Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") 2=resolvable ("Yes" in "ResolvableYN") 3=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only 2=author only 3=accession only 4=depository and author 5=depository and accession 6=author and accession 7=depository, author, and accession

To adapt to an ordinal scale, it could either be: Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") or resolvable ("Yes" in "ResolvableYN") 2=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only or author only or accession only 2=(depository and author) or (depository and accession) or (author and accession) 3=depository, author, and accession

Thoughts?

This might be another "out of the scope of this project" or it might be redundant of resolvability and attribution or it might be essential...I dunno. As I rethink, I think it's probably redundant and not needed, but I originally liked it as an overall metric (i.e. are the citations both resolvable and attributable).....alternatively, is there a way to crosstab and analyze these? Is that maybe the best route?

Again, no rush...I've got plenty to work on.

Thanks!!!

Sincerely,

Sarah Walker Judson [Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com> Thu, Jul 22, 2010 at 4:09 PM To: hpiwowar@gmail.com Also,

you mentioned that maybe I should pool my "snapshot" (2000/2010) and "time series" (2005-2009 for sysbio and amnat only) to get a bigger sample size. the former was collected sequential and the later randomly. i'm a bit worried this affects assumptions about data collection, but don't now if this is as strict of an assumption in this arena as in biology. i was thinking of running both separately and then pooling and then choosing one as the focus (probably the 2000/2010) on for reporting and stating if/if not the other sets produced similar results.

thoughts?

thanks.

Sincerely, Sarah Walker Judson [Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com> Thu, Jul 22, 2010 at 6:31 PM To: hpiwowar@gmail.com Heather -

I've got notes from my reattempts this afternoon, so expect a lengthy email following this (but possibly not until tomorrow morning).

Main success: getting results I understand and generating meaningful questions (I feel like I'm on the verge of having something to report) Main problem: can't get all the factors to run at once.

So, that's what i'm hoping for help on....I'm planning on writing a more detailed email about successes and further questions, but for now I'll just barf my code into this email b/c maybe you've run into this problem before. The input file is also attached. My notes on the error are at the bottom as part of the code. My apologies for the mess...mostly, my husband's complaining that I'm still on the computer rather than eating dinner, so I'll send the long version later.

a=read.csv("ReuseDatasetsSnap.csv") attach(a) names(a) str(a) xtabs(~ Journal+ResolvableScoreRevised) xtabs(~ YearCode+ResolvableScoreRevised) xtabs(~ DepositoryAbbrv+ResolvableScoreRevised) xtabs(~ DepositoryAbbrvOtherSpecified+ResolvableScoreRevised) xtabs(~ TypeOfDataset+ResolvableScoreRevised) xtabs(~ BroaderDatatypes+ResolvableScoreRevised) library(Design) ddist4<- datadist(Journal+YearCode+DepositoryAbbrv+BroaderDatatypes) #can't get this to run with all the factors at once, or even two at a time options(datadist='ddist4') ologit4<- lrm(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, data=a, na.action=na.pass) print(ologit4)

modify to match output before running-->#Y repeats

sf <- function(y)
     c('Y>=0'=qlogis(mean(y >= 0)),'Y>=1'=qlogis(mean(y >= 1)),
     'Y>=2'=qlogis(mean(y >= 2)))

s <- summary(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, fun=sf) s text(Stop!)#modify to match output before running -->which, xlim plot(s, which=1:3, pch=1:3, xlab='logit', main=' ', xlim=c(-2.3,1.7))

Error in datadist(Journal + YearCode + DepositoryAbbrv + BroaderDatatypes) :
fewer than 2 non-missing observations for Journal + YearCode + DepositoryAbbrv + BroaderDatatypes
In addition: Warning messages:
1: In Ops.factor(Journal, YearCode) : + not meaningful for factors
2: In Ops.factor(Journal + YearCode, DepositoryAbbrv) :
+ not meaningful for factors
3: In Ops.factor(Journal + YearCode + DepositoryAbbrv, BroaderDatatypes) :
+ not meaningful for factors
1. i got this error before when I was running in it non-factor, but it cleared up when I either ran less variables at once or coded to dummy variables (1,2,3,4, etc) instead of letters (ea, eco, bio, etc)
2. internet searches primarily turning up code that i don't understand or a few dicussion forms that don't make sense to me
3. search terms used: "datadist" & "not meaningful for factors"l; "datadist" and "fewer than 2 non-missing"
4. don't get this problem when running each factor separately. i ran most separately to practice interpretation. main problem is that ME (journal) is correlated with genbank (depository) and gene (datatype) = each comes out significantly "better" when run as a separate model (factor by factor)... this is where a multiple factor model (which isn't working) would come in handy...to (maybe) tease these apart (i.e. is publishing in ME, reusing from genbank, or using a gene what determines resolvability/attribution)

Sincerely, Sarah Walker Judson [Quoted text hidden]

ReuseDatasetsSnap.csv 259K Heather Piwowar <hpiwowar@gmail.com> Thu, Jul 22, 2010 at 8:08 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Hi Sarah,

Hmmmm interesting. Possibly related to zeros in your crosstabs? I may have some ideas. No time now but will dig in tomorrow morning.

Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Fri, Jul 23, 2010 at 7:17 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Sarah,

Good question. I'd say yes, go for it. I agree, it helps to flush out the story. I like the first option better... a linear combination of the other two, then, isn't it? Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Fri, Jul 23, 2010 at 7:22 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> So based on my gut and what Todd said, I think combine them. Include a binary variable for snapshotYN... we hope that that one is not significant, but it will help catch things if it is. And you have a variable for year already, right?

I'm not worried about interpretability.

All of this said, once we get the stats working in general, it is probably worth an email to the data_citation list summarizing your approach and prelim interpretations, so they can give feedback if anything is out of wack methodologically. Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Fri, Jul 23, 2010 at 7:23 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Sarah, I'm busy till 10am but will have time after that to dig in. Heather

[Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com> Fri, Jul 23, 2010 at 10:33 AM To: hpiwowar@gmail.com Heather -

Thanks so much again for your help! Here's the promised "long version"....sorry it took me awhile to get it to you. I apologize for the length and don't necessarily expect lengthy responses to each portion, writing this out is helping me think through it all and hopefully helping you become more acquainted with my data. I indicate the most important questions with a double asterix.

First, attached (PrelimOutputs.txt) are some preliminary results with my interp written under each output. Even though I know we want to be running multiple factors at once (I think) I found this to be a useful exercise to familiarize myself with the statistic and running it in R. It started to reveal some support for trends I was expecting to see (i.e. gene sequences are more resolvable and attributed), so that's promising.

- Like I said before, the main problem I'm having is running all the factors together. I don't understand the error I'm getting and can't find much help on it (at least not that I can understand). This would be especially helpful for distinguishing if publishing in Molecular Ecology, using a gene sequence, or utilizing Genbank is most influential in having a resolvable/attributable data citation. But at the same time, these are all correlated so it might just be more of a mess because of multicollinearity.

I have some more specific questions in general and about the attribution/resolvability scoring.

In general: - You mentioned "subdiscpline" as a factor yesterday. Were you referring to what I call "data type" or the discipline of the journal? Concerning the later, many of the journals are classified (according to ISI) by usually two of our three major disciplines. I.E. American Naturalist is classified as Ecology and EvoBio, GCB is classified as Environmental Science and Ecology. Few have just one. I coded this for now as binary for each discipline, but given the existing problems with multiple factors, this might be too much to add. Also, I tend to think most of the journals belong to one discipline more strongly than the other...i.e. I would say AmNat is Ecology, Sysbio is EvoBio, and GCB is Environ Sci, etc. This would also reduce the number of factors for this category. Thoughts?

- - By testing so many factors and character states, aren't we pretty prone to Type 1 error? How do we "prevent" this? Does running factors separately vs. combined help at all?

- For some papers I have multiple datasets per paper. During data collection, I had them all pooled and separated by commas to indicate nuances. Primarily, I only split an article into multiple datasets if they were different datatypes OR if one dataset was a self reuse and the other was acquired via another mechanism. There are about 5-10 incidences where a dataset was split even though they were the same datatype because one was attributed/resolvable and the other was not (i.e. they were acquired in different ways). Will this lead to independence problems? (P.S. I have some preliminary sentences about this for the methods if this doesn't make sense, let me know if you need it).

- - For some of my factors, I have both a "broad" and a "specified" classification. I'm more inclined to the broad for stats, but always hate to toss resolution. Right now I'm most inclined to keep datatypes broad and depository specific. Here are the classifications for comparison.

Datatypes - Specified (*how data was collected) Bio = organismic, living Paleo = organismic, fossil Eco = community (multi-species) GS = gene sequence GA = gene alignment GO = other gene (blots, protein) Ea = earth (soil, weather, etc) GIS = layers XY = coordinates PT = phylogenetic tree

Datatypes - Broad (*What i am currently using) G = gene (Gs, Ga, Go) O = organismic (living and fossil = Bio and Paleo) S= spatial (GIS, XY) Eco = community (multi-species) Ea = earth (soil, weather, etc) PT = phylogenetic tree

Dataypes - Broader still (haven't attempted) Ecology = organismic (Eco, Bio, Paleo) Environ Sci= spatial & earth EvoBio = gene (PT, GS, GA, GO)

Depository - Specified (*currently using) G = Genbank T = treebase U = url or database (non-depository) E = extracted literature O = other (correspondence, not indicated)

Depository - Broad (*results similar to above) G = Genbank T = treebase O = other (url, extracted, correspondence, not indicated)

Depository - Binary (haven't attempted) D = Depository (i.e. people can both deposit and extract data = genbank, treebase) O= other

Resolvability:

- I'm having a little problem that will probably require recoding where I only counted a depository reference if it was in the body of the text, but not in supplementary appendices or even a table caption. I think I started counting a depository reference later in data collection if it was in the table caption, but still not if it was in the supplementary caption. I want to get your opinion on how this should be coded in the resolvability categories: 0="no information, can't find it" = none of the below 1="could find it with extra research"= depository or author or accession ONLY 2="could find it just with info provided in the paper" = depository and (author and/or accession)

- I think a table with Genbank mentioned in the table caption and accessions given therein should be a "2". However, I think Genbank mentioned in the header of an appendix followed by accession (i.e. same table as previous but in supplementary information) should be counted as a "1" because you would have to track down the supplementary information, which in the case of sysbio and other articles is difficult. Again, this is considering that genbank was never mentioned in the body of the paper, but the authors said something like "additional information about sequences is provided in appendix a". This gives the reader no guarantee that when they dig up the appendix that it will actually have accession numbers, as it may just describe which taxa each sequence was from or a museum voucher number for the specimen. So that's my bias, just want to see if it's justified in your mind. It will require some recoding no matter.

a little bit of a problem with f/ac

Attribution: Another quick question about scoring which like the above requires lengthy text to explain. Here are my final scoring categories and explanations:

0= "the data is not attributed" - no author or acession (no author also = a self citation (i.e. previous review paper) but other reuse...i.e. original data authors not attributed at all)

1 = "the data is indirectly attributed" - accession only or author only (author only also = a self citation (i.e. previous review paper) but other reuse...i.e. original data authors not attributed at all) - this still includes self reuses of previous data. In the discussion, I would then talk about what % self reuse occurs as a caveat/modifier about this information

2 = "the data is directly attributed" - author and accession - regardless of self. I think if the author reused their own data and gave the accession number, that's great (it happened so rarely, so I appreciated it when it did....it seemed less like personal aggrandizement to rack up a citation and more of open data sharing...of "hey, you can use this data too" rather than, "please go read my other publications to see if you want my data and maybe can dig up how to get it in those other papers because i don't feel like explaining it here"

So, my explanations probably show my bias. I think it's ok to include self reuses partly because the sample size is small already and partly because some people legitimately reuse their own data. However, I don't think it should count when they cite themself but really used other data (what I call self citation/other reuse....meaning they refer to their previous collection of data and vaguely state that it was from external sources, but give no credit to the original data authors in the current paper and they might in the previous paper, but I don't think we can assume that). So, again, do my biases/categories seem justified? Should we just throw out self reuses altogether as you've been doing. Also, I should note that as I mentioned above (in "independence"), self reuses and other reuses from the same article were separated for analysis.

Well, hopefully you survived all that. Thanks again for your diligence and continued help!!

Sincerely, Sarah Walker Judson [Quoted text hidden]

PrelimOutputs.txt 21K Sarah Walker Judson <walker.sarah.3@gmail.com> Fri, Jul 23, 2010 at 10:59 AM To: hpiwowar@gmail.com One crazy idea about the multiple factor problem. It worked when I ran everything together as dummy variables (not binary like the link you sent, but 0,1,2,3, etc)...that was how I ran it before our chat yesterday. I could numerically code/rank the journals, datatypes, etc according to their coefficients when run separately, then run all the factors together to maybe get at which is the most influential (journal vs. datatype vs. year). I dunno if that even works at all, but it's the only plausible work around I can think of given I don't know a ton about this method. It's probably totally unconventional, but I thought I might as well mention it.

Sincerely, Sarah Walker Judson

P.S. I'm on gchat most of the day (i.e until 6pm), but will be invisible as usual. [Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Fri, Jul 23, 2010 at 1:10 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Hi Sarah. I'm going to go have lunch and then come back and chat.

Question: have you tried datadist with commas rather than +s?

so: ddist4<- datadist(Journal,YearCode,DepositoryAbbrv,BroaderDatatypes)

These lines seem to run successfully: ddist4<- datadist(Journal,YearCode,DepositoryAbbrv,BroaderDatatypes) #can't get this to run with all the factors at once, or even two at a time options(datadist='ddist4') ologit4<- lrm(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, data=a, na.action=na.pass) print(ologit4) ok, more chatting later,

Heather

Email: Advice on data collections in stats

3 messages Heather Piwowar <hpiwowar@gmail.com> Thu, Jul 22, 2010 at 8:07 PM Reply-To: hpiwowar@gmail.com To: Todd Vision <tjv@bio.unc.edu>, Sarah Walker Judson <walker.sarah.3@gmail.com> Todd,

Could do with some stats advice.

Sarah collected data in two different ways: randomly and consecutively. My guess is that she can concatenate these for her main analysis... maybe with a binary variable indicating the type of data collection to hopefully catch artifacts.

That said, I'm a bit unsure and I don't want to lead her down the wrong path.

What do you think? Heather

Forwarded message ----------

From: Sarah Walker Judson <walker.sarah.3@gmail.com> Date: Thu, Jul 22, 2010 at 4:09 PM Subject: Re: Scoring and Stats questions_Dataone To: hpiwowar@gmail.com

Also,

you mentioned that maybe I should pool my "snapshot" (2000/2010) and "time series" (2005-2009 for sysbio and amnat only) to get a bigger sample size. the former was collected sequential and the later randomly. i'm a bit worried this affects assumptions about data collection, but don't now if this is as strict of an assumption in this arena as in biology. i was thinking of running both separately and then pooling and then choosing one as the focus (probably the 2000/2010) on for reporting and stating if/if not the other sets produced similar results.

thoughts?

thanks.

Sincerely, Sarah Walker Judson

On Thu, Jul 22, 2010 at 3:49 PM, Sarah Walker Judson <walker.sarah.3@gmail.com> wrote: Sorry, one question I forgot (but isn't urgent since I have a lot to chew on): should I even attempt the "ideal citation" score, or just worry about the resolvability and attribution components?

To reiterate (it's kind of a combination of resolvability and attribution): Ideal (previously "Good") citation score Ideal_CitationYN* This came out the same as my Knoxville calculation of author+depository+accession 1=Y=Resolvable + Attribution (adding the two previous yes and no categories) 0=N=lacking one or the other or both

Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") 2=resolvable ("Yes" in "ResolvableYN") 3=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only 2=author only 3=accession only 4=depository and author 5=depository and accession 6=author and accession 7=depository, author, and accession

To adapt to an ordinal scale, it could either be: Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") or resolvable ("Yes" in "ResolvableYN") 2=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only or author only or accession only 2=(depository and author) or (depository and accession) or (author and accession) 3=depository, author, and accession

Thoughts?

This might be another "out of the scope of this project" or it might be redundant of resolvability and attribution or it might be essential...I dunno. As I rethink, I think it's probably redundant and not needed, but I originally liked it as an overall metric (i.e. are the citations both resolvable and attributable).....alternatively, is there a way to crosstab and analyze these? Is that maybe the best route?

Again, no rush...I've got plenty to work on.

Thanks!!!

Sincerely,

Sarah Walker Judson

Todd Vision <tjv@bio.unc.edu> Fri, Jul 23, 2010 at 4:41 AM To: "hpiwowar@gmail.com" <hpiwowar@gmail.com> Cc: Sarah Walker Judson <walker.sarah.3@gmail.com> I haven't been following the discussion closely enough to be sure, but a general approach would be to combine, test for heteregeneity and, in its absence, accept the stats on the combined sample. But the population of study for the two sets sounds sufficiently different (ie the identity of journals, as opposed to the random/sequential distinction) that a combined analysis would be difficult to interpret. I'm not sure you would not want to use them to ask the same questions.

Todd [Quoted text hidden] --

Todd Vision

Associate Professor Department of Biology University of North Carolina at Chapel Hill

Associate Director for Informatics National Evolutionary Synthesis Center http://www.nescent.org

Heather Piwowar <hpiwowar@gmail.com> Fri, Jul 23, 2010 at 7:14 AM Reply-To: hpiwowar@gmail.com To: Todd Vision <tjv@bio.unc.edu> Cc: Sarah Walker Judson <walker.sarah.3@gmail.com> Thanks Todd, that's helpful. Heather

Chat Transcript July 22

2:39 PM Heather: Hi Sarah!

 You around?
Sarah: yep...trying to work on what you sent
Heather: cool.
 I don't have any definitive answers, just ideas.
Sarah: ok

2:40 PM Heather: what do you think? based on what you've seen, do you think continuing down this path a bit more makes sense?

 or turn to the chi-sq or poisson or something else?
Sarah: yeah, i think it will work, it's just a steeper learning curve for me than i expected

2:41 PM my main question before proceeding on, is: what exactly are we trying to test with this?

Heather: yeah, agreed.
 it feels like good stuff to know
Sarah: i.e. mainly trying to say which factor is most influential?
Heather: but always a bit hard to be learning it when you actually need to use it, yesterday.
Sarah: or just trying to get a measure of significance to stamp on the results?

2:42 PM to me, the percentage table breakdowns really show what's going on

Heather: right good question.
 so what the percentage table doesn't give us is a multivariate look that tries to hold potential confounders constant

2:43 PM Sarah: agreed

Heather: so we are trying to figure out which factors are important
 which ones aren't
Sarah: still agreed
Heather: yeah, I think that is mostly it :)
 maybe also

2:44 PM an estimate of prevalence and percentages in some factors, independently of confounders

 for example
 journal impact factors vary a lot by subdiscipline

2:45 PM and there is lots of prior evidence (mine, nic's, etc) that high impact factors

 correlate with stronger policies, and probably more sharing
Sarah: i wasn't considering throwing in high impact factors b/c the journals were already selected by that criteria and there are only 6 journals, so i didn't think that would be an informative variable

2:46 PM Heather: so it would be ideal if we could decouple impact factor from rates of sharing

 yeah, I hear you. true, you only have 6 journals to work from.
Sarah: i would use it more as explanatory in the discussion to maybe say why one journal was better or worse....or running impact factor as a variable if journal was a significant predictor
Heather: right.

2:47 PM so then let me change my example

Sarah: ok, sorry to get off track on details
Heather: to be figuring out the relative rates of sharing in any given journal
Sarah: so say, journal vs. datatype
Heather: right
 assuming the mix of datatypes was the same

2:48 PM (which obviously it isn't)

 so I'd say the goals are
Sarah: yeah, it's highly coorelated with journal
Heather: right
 (which of course will make it hard for the stats to tease it out
 but that's life)
 so I'd say the goals are
 a) which factors are important

2:49 PM b) what are the relative levels of sharing, independent of other variables

 that sync with what you think?
Sarah: could you explain what you mean by b)?
 just percentages by journal/datatype, etc?
Heather: yeah. let's see, my head needs to get more into your data.

2:50 PM so, for example, we could want to say that

 when data is sequence data, the odds that it will be shared

2:51 PM Sarah: are better than if it's ecological

Heather: at a really high level of best-practice
 are 1.5 times more than if it were ecological.
 yes, exactly.
 so you would have to choose a "baseline"

2:52 PM (or I think when you define the variable as a factor, it chooses a baseline for you? I forget)

Sarah: hmmm...i'm not clear
Heather: .... independent of whatever journal it is published in

2:53 PM Sarah: (and on a technical note, I'm having trouble defining factors in Design)

Heather: ok.... oh, I think that is easy.
 can you just say as.factor(x) or factor(x), or does that not work?

2:54 PM Sarah: as.factor hasn't been working, in the documentation it says: "In addition to model.frame.default being replaced by a modified version, [. and [.factor are replaced by versions which carry along the label attribute of a variable."

Heather: hmmmm. apparently not easy.
 I can look in my code.
Sarah: basically, when the only way I could get my data to behave in Design, i substituted numbers
 but that makes things act like ordinal scales

2:55 PM Heather: yeah, and that is probably making the results strange

Sarah: let me dig up the output differences real quickl
 journal straight:
 Coef S.E. Wald Z P

y>=1 1.5626 0.4291 3.64 0.0003 y>=2 -0.4607 0.4010 -1.15 0.2506 y>=3 -1.2100 0.4127 -2.93 0.0034 y>=4 -1.3597 0.4164 -3.27 0.0011 y>=5 -2.4611 0.4617 -5.33 0.0000 Journal=EC -1.5524 0.6321 -2.46 0.0140 Journal=GCB -1.3413 0.5323 -2.52 0.0117 Journal=ME 1.7319 0.5661 3.06 0.0022 Journal=PB -0.6605 0.5230 -1.26 0.2066 Journal=SB 0.7495 0.4725 1.59 0.1127

 journal coded as a dummy number (1,2,3,4,5 &6):
 Coef S.E. Wald Z P

y>=1 2.84154 0.42461 6.69 0.0000 y>=2 0.90291 0.36971 2.44 0.0146 y>=3 0.18469 0.36705 0.50 0.6148 y>=4 0.04250 0.36783 0.12 0.9080 y>=5 -1.01183 0.39524 -2.56 0.0105 JournalCode -0.55736 0.09414 -5.92 0.0000 2:56 PM Heather: oh, leaving the journal straight, do you mean you have 5 different binary variables?

Sarah: no, i don't but i think Design is interpretting it as such
 my columns look like:
 Journal
 ME
 GBC
 ME
 SB
 etc

2:57 PM or

 Journal
 sorry,
 JournalCode
Heather: right. so I'm guessing then maybe Design is interpreting it as a factor already?
Sarah: 1
 2
 1
 3
Heather: try str(yourdataframe)
 and see what datatype R thinks the Journal column is?

2:58 PM Sarah: yep

 it's coming through as a factor
Heather: ok, good.
 confusing for you, but good.
Sarah: but, that type of output doesn't make a lick of sense to me
Heather: ok.

2:59 PM Sarah: well, i guess it just will give a ton of covariates like you were saying

Heather: right
Sarah: which seems like you would have type1 error problems
Heather: one for each journal
 in the list in the results that looks like this
 Journal=EC -1.5524 0.6321 -2.46 0.0140

Journal=GCB -1.3413 0.5323 -2.52 0.0117 Journal=ME 1.7319 0.5661 3.06 0.0022 Journal=PB -0.6605 0.5230 -1.26 0.2066 Journal=SB 0.7495 0.4725 1.59 0.1127

 does that include all of your journals, or is it missing one?

3:00 PM Sarah: ummm...it's missing Amnat

Heather: right
Sarah: I might have just not copied it
Heather: so I think that means it used Amnat as the base
 good question... go have a look....
Sarah: oh
Heather: I'm thinking it might not be there
 it is kind of like having a state column

3:01 PM well nevermind that analogy was just going to make things worse.

Sarah: it's in the table (there was a chance it didn't have reuse)
Heather: can you see, is it actually missing Amnat?
Sarah: it's in the crosstab
Heather: in the regression results?
 but it is in the input data?

3:02 PM Sarah: yep, it's in the raw table and the crosstab

Heather: yup.
Sarah: ResolvableScore

Journal 0 1 2 3 4 5 AN 3 9 4 0 4 0 EC 10 4 2 0 2 0 GCB 13 13 3 0 1 0 ME 0 6 2 2 4 8 PB 9 13 4 0 4 0 SB 4 19 8 2 7 10

Heather: so that means that it is using it as the base case
 those results mean,

3:03 PM or rather this line

 Journal=PB -0.6605 0.5230 -1.26 0.2066
 means that

3:04 PM "whether the journal was amnat or PB made no difference in how we'd predict the citation-quality score"

 (or whatever it was that regression was regressing on)
 whereas
 Journal=EC -1.5524 0.6321 -2.46 0.0140
 means

3:05 PM "being in the journal EC made a difference to the citation-quality score, compared to being in the journal Amnat, p=0.01"

 and if you want to see how big a difference
Sarah: ok...that makes more sense
Heather: we have to look at the coefficients and decode them
 but they would tell us something like

3:06 PM Sarah: gut would say EC (ecology) is worse than Amnat and ME (molecular eco) is better

 so, yeah, EC is neg and ME positive
Heather: "being in the journal EC made a dataset 1.4 times more likely to have a quality score 1 level higher than an equivalent study published in amnat"
 or something liek that
 oh, whichever, I didn't try to make my guess very realistic :)
Sarah: makes sense for the others based on what i know about the data

3:07 PM Heather: and I'd have to reread the ordered tutorial to make 100% sure I'm getting my summary blurb right

Sarah: so then, can i force which journal (or factor) is the base?
Heather: about "1 level higher" etc because I'm not very used to this ordinal stuff
 but that is the general idea
Sarah: i.e. my impression of which is "worst" or "best"
Heather: good question
 maybe
 I think so
Sarah: or, should I let the stats remove my bias?
Heather: if you do levels(dataframe$Journal) what does it say?
Sarah: or determine "worst" from the pivot tables
Heather: urm, I think mathematically it doesn't matter

3:08 PM Sarah: just for interp

Heather: so there is an advantage to picking one that is easy to interp
 exactly
 I wonder how it picks it now?
 might be the level with the most N
 which would probably be a good call regardless
Sarah: that code didn't work
 i'm getting an error "dataframe not found"...oh sorry, i need to insert my data object there
 whoops

3:09 PM just a sec

 yeah, so AN is the first, but they are arranged alphabetically, not in order of encounter in the raw table
Heather: interesting.
 so I'm guessing it might use levels()[0]
Sarah: i mean, it may correlated with sample size, but i don't think so
Heather: as the base?

3:10 PM Sarah: i'm not familiar with levels, sorry

Heather: ok
Sarah: so i can't make an intelligent stab at that
Heather: no problem
 so a factor is a vector
Sarah: but, i can figure it out to spare you the time, or just interp the way it comes out
 it makes a lot more sense now
Heather: well, hrm no
 levels is like the "codebook" that is uses to code factors

3:11 PM try ?levels

 and for what it is worth it isn't the most intuitive part of R to me
 either
 so I'd maybe skip trying to force that for now
 let it pick what it wants to pick
 and later when/if we decide this is the way we want to go and you see an

3:12 PM opportunity to really improve the interpretation by forcing it, figure it out then...

 anyway, your call, but that's what I'd do.
 so right now to interp your results

3:13 PM you'll have to see what is left out of the results output, or check out levels() for each of your categorical variables

 or some combo
 does that make sense?
 or enough sense?

3:14 PM Sarah: yep... a lot more sense than before

Heather: cool
Sarah: i thought the output with all the journal types list out was the wrong way
Heather: so for what it is worth
Sarah: because the example use all binary coding
 rather than A, B, C, etc
Heather: your y output variable could be coded as a factor as well
 if you wanted to

3:15 PM Sarah: well, i tried to order it in a way that was somewhat "bad to god"

Heather: and then you can do a multinomial regression on that unordered-factor-category y variable,
Sarah: *good
Heather: like in the last tutorial I sent.
 right! and mostly that is a great idea
 the only reason you could, maybe, treat it like a factor instead
 is to get around the "proportional odds" stuff

3:16 PM Sarah: ok...i don't know how that will come out with this new way

Heather: by seeing how it behaves if you just remove all semblence or order.
Sarah: ok.
 i'll try this again and maybe give that a shot
Heather: right, I don't know either. And I'm not necessarily really recommending it.....
 except maybe.....
 kind of like how we always use two-sided p-values

3:17 PM we think we know which way the interaction will go

 and so we could, in theory, use a one-sided p-value
 but maybe we are wrong
 and we should use stat tests that reflect that
Sarah: hmmm...i'm not following

3:18 PM Heather: let me back up for a minute and ask a question to make sure

 I'm on the right page, because I forget
 for your "best practice" levels, does everything that meets the criteria to be in level 3 also meet the criteria to be in level2?

3:19 PM Sarah: no

 ResolvableScore

0=no Depository or Accession or Author (Justification: you know they used data but not exactly how it was obtained = probably couldn't find it again…i.e. "data was obtained from the published literature") 1=Author Only (Justification: you could track down the original paper which might contain an accession or extractable info) 2=Depository or Database Only (Justification: You might be able to look up the same species/taxon and find the information per the criteria in the methods) 3=Accession Only (Justification: Accession number given but depository not specified = you would probably be able to infer which depository it came from based on format, just as I was usually able to tell that they were genbank sequences by the format eve 4=Depository and Author (Justification: Although no accession given, many depositories also have a search option for the author/title of the original article which connects to the data) 5=Depository and Accession (Justification: "Best" resolvability….unique id and known depository = should find exact data that was used) 3:20 PM oh, sorry i copied an d pasted before i realized that was the long version

Heather: right, pulled it up too
Sarah: but, could make it so it was
Heather: so, by treating that as an ordered variable,
 we are making some assumptions that may not be true

3:21 PM Sarah: yes, like that my ranking is reflective of true difficulty of finding a dataset

Heather: if we think about other ordered variables, people who think something is "very good" also think it is at least "good"
 right :)
Sarah: yeah....but, i'm also grappling with the problem here that we have a perception of a good practice, but most of the data doesn't meet that

3:22 PM Heather: I'd say, perhaps we could improve a few things at once

Sarah: i.e. we'd like to see depoistory and accessio mentioned
Heather: by collapsing your variables
Sarah: but most just give authors
 author
 the ordered version i see isL
 :
Heather: into interpreable levels
Sarah: author only
 depository and author
Heather: yeah, but I'd even try using other lingo for a minute
Sarah: depository and author and accession

3:23 PM Heather: so "no information, can't find it"

Sarah: but , i have very few of the later
Heather: "could find it with extra research"
 "could find it just with info provided in the paper"
 or something like that
Sarah: ok...
 but i'm talking about those same things jsut by the criteria i'm defining them

3:24 PM Heather: then you have a codebook to know what criteria you use to apply those labels

 yeah
 but there aren't 6 that make sense to talk about
 when you stop talking about their criteria, do you know what I mean?
Sarah: "no information, can't find it" = none of the below
 "could find it with extra research"depository or author ONLY

3:25 PM Heather: in some ways, the people reading the paper don't care if a citation includes the author and one of depository or...

Sarah: "could find it just with info provided in the paper" depository and (author or accession)
Heather: they care... can I FIND it :)
 or am I attributed
 or whatever
 yup
Sarah: sorry, i'll put those in order
 "no information, can't find it" = none of the below
 "could find it with extra research"depository or author ONLY
Heather: and I think that will help with the ordered interpretation
Sarah: "could find it just with info provided in the paper" depository and (author or accession)
Heather: and reducing the number of levels

3:26 PM (which will help with N in cells and maybe proportionalness)

Sarah: and then use percentages to just state the paltry number of papers that give the accession number
Heather: yup
Sarah: rather than holding accession as the holy grail
Heather: and ditto on attribution
 make it what matters

3:27 PM so "the author is not attributed"

 "the author is indirectly attributed"
 "the author is directly attributed"
 (and maybe this means you need another endpointn for the depository attribution?)

3:28 PM anyway... I wouldn't spend oodles of time reworking things into this framework

Sarah: i don't think it will be bad
Heather: because maybe it won't be practical
 or Todd won't like the direction or whatever.....
 but that's what my gut tells me.
Sarah: i still like my original categories for display tables, but you're right about the meaning for stats
 maybe that will also keep todd happy

3:29 PM Heather: yes agreed! good point.

 :) And I don't want to put words in Todd's mouth, I don't know what he will think....
Sarah: no, i think we all think accession number (direct data attribution) is the holy grail
 but, that's just not a reality in this data
Heather: yeah. so then can you define a midpoint or two between that and nothing

3:30 PM Sarah: one quick question on the attribution scale,

Heather: yup?
Sarah: would accession number (without an author name) be direct or indirect?
 i say indirect
 but it hurts when we want to show accession as the epitome
Heather: yeah, I'd say that too.
Sarah: of a good data citation

3:31 PM Heather: yeah, but you know what? when you put it that way, accession number isn't actually the epitome

 of everything
Sarah: yeah
Heather: that is what genbank mostly does right?
Sarah: yep, exactly
Heather: and it comes under fire in terms of people not getting direct attribution
Sarah: but it's not standard in the literature by any means
Heather: so if your data reflects taht, probably all the better
Sarah: a lot of people say " i searched genbank and used sequences by author a, b, and c"

3:32 PM Heather: do they really? I wouldn't have expected that.

Sarah: ok, so can i run the attribution categories by you real quick?
Heather: I've mostly seen "and used accession number A , B, C"
 yes
 I think I've got 10 more mins
Sarah: "the author is not attributed" - no author or acession

"the author is indirectly attributed" - accession only "the author is directly attributed" - author and accession 3:33 PM wait...that excludes author only

 "the author is not attributed" - no author or acession

"the author is indirectly attributed" - accession only or author only "the author is directly attributed" - author and accession

 hm....but "author only"
 is direct
Heather: do you need "accession" on directly attributed?
 right.
Sarah: "the author is not attributed" - no author or acession

"the author is indirectly attributed" - accession only "the author is directly attributed" - author and accession or author only

Heather: seems strange, but in terms of attribution per se, accession not needed
Sarah: but, then that's not ordered

3:34 PM directly does not necessarily include the indirect

Heather: good point
 thinking

3:35 PM Sarah: one addition: correspondence (ie. data set was obtained from my buddy so and so)"the author is not attributed" - no author or acession "the author is indirectly attributed" - accession only or correspondence only "the author is directly attributed" - author and accession or author only

Heather: well hrm I'm not quite sure what to think.
Sarah: we could change it to "data directly attributed"

3:36 PM "the data is not attributed" - no author or acession "the data is indirectly attributed" - accession only or author only "the data is directly attributed" - author and accession

 or call it "data authorship"
Heather: yeah, that works I think, doesn't it?

3:37 PM Sarah: that's more what we're interested in too....is the DATA being cited?

 still brings in the problem of author attribution as the current mode of tracking data
Heather: yes, exactly
 nice
 ok, I have to run.
Sarah: ok. thanks soooo much!
Heather: I'm guessing we aren't out of the woods yet
 but making progress
Sarah: i'm not used to categorical stats and that helped a bunch
Heather: great

3:38 PM Sarah: ok, sure, will send through email or whatever and usually when you get an email from me, i'm available on chat for the next little while

 thanks!
Heather: ok, good to know. I think I'll be AWOL tonight, but avail tomorrow.
 bye!

DataONE:Notebook/Summer 2010/2010/07/22: Difference between revisions

Revision as of 15:41, 23 July 2010

Email:Scoring and Stats questions_Dataone

Email: Advice on data collections in stats

Chat Transcript July 22

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools

@@ Line 1: / Line 1: @@
 ==Email:Scoring and Stats questions_Dataone==
 messages
 Sarah Walker Judson <walker.sarah.3@gmail.com>	 Wed, Jul 21, 2010 at 5:15 PM
 To: Heather Piwowar <hpiwowar@gmail.com>
@@ Line 300: / Line 300: @@
 Heather
+[Quoted text hidden]
+Sarah Walker Judson <walker.sarah.3@gmail.com>	 Thu, Jul 22, 2010 at 3:49 PM
+To: hpiwowar@gmail.com
+Sorry, one question I forgot (but isn't urgent since I have a lot to chew on):  should I even attempt the "ideal citation" score, or just worry about the resolvability and attribution components?
+To reiterate (it's kind of a combination of resolvability and attribution):
+Ideal (previously "Good") citation score
+Ideal_CitationYN* This came out the  same as my Knoxville calculation of author+depository+accession
+=Y=Resolvable + Attribution (adding the two previous yes and no categories)
+=N=lacking one or the other or both
+Ideal_CitationScoreSimple
+=not resolvable or attributed
+=attributed ("Yes" in "AttributionYN")
+=resolvable ("Yes" in "ResolvableYN")
+=attributed and resolvable
+Ideal_CitationScoreGoodGradient
+=none
+=depository only
+=author only
+=accession only
+=depository and author
+=depository and accession
+=author and accession
+=depository, author, and accession
+To adapt to an ordinal scale, it could either be:
+Ideal_CitationScoreSimple
+=not resolvable or attributed
+=attributed ("Yes" in "AttributionYN")  or resolvable ("Yes" in "ResolvableYN")
+=attributed and resolvable
+Ideal_CitationScoreGoodGradient
+=none
+=depository only or author only or accession only
+=(depository and author) or (depository and accession) or (author and accession)
+=depository, author, and accession
+Thoughts?
+This might be another "out of the scope of this project" or it might be redundant of resolvability and attribution or it might be essential...I dunno. As I rethink, I think it's probably redundant and not needed, but I originally liked it as an overall metric (i.e. are the citations both resolvable and attributable).....alternatively, is there a way to crosstab and analyze these? Is that maybe the best route?
+Again, no rush...I've got plenty to work on.
+Thanks!!!
+ Sincerely,
+Sarah Walker Judson
+[Quoted text hidden]
+Sarah Walker Judson <walker.sarah.3@gmail.com>	 Thu, Jul 22, 2010 at 4:09 PM
+To: hpiwowar@gmail.com
+Also,
+you mentioned that maybe I should pool my "snapshot" (2000/2010) and "time series" (2005-2009 for sysbio and amnat only) to get a bigger sample size. the former was collected sequential and the later randomly. i'm a bit worried this affects assumptions about data collection, but don't now if this is as strict of an assumption in this arena as in biology. i was thinking of running both separately and then pooling and then choosing one as the focus (probably the 2000/2010) on for reporting and stating if/if not the other sets produced similar results.
+thoughts?
+thanks.
+Sincerely,
+Sarah Walker Judson
+[Quoted text hidden]
+Sarah Walker Judson <walker.sarah.3@gmail.com>	 Thu, Jul 22, 2010 at 6:31 PM
+To: hpiwowar@gmail.com
+Heather -
+I've got notes from my reattempts this afternoon, so expect a lengthy email following this (but possibly not until tomorrow morning).
+Main success: getting results I understand and generating meaningful questions (I feel like I'm on the verge of having something to report)
+Main problem: can't get all the factors to run at once.
+So, that's what i'm hoping for help on....I'm planning on writing a more detailed email about successes and further questions, but for now I'll just barf my code into this email b/c maybe you've run into this problem before. The input file is also attached. My notes on the error are at the bottom as part of the code. My apologies for the mess...mostly, my husband's complaining that I'm still on the computer rather than eating dinner, so I'll send the long version later.
+a=read.csv("ReuseDatasetsSnap.csv")
+attach(a)
+names(a)
+str(a)
+xtabs(~ Journal+ResolvableScoreRevised)
+xtabs(~ YearCode+ResolvableScoreRevised)
+xtabs(~ DepositoryAbbrv+ResolvableScoreRevised)
+xtabs(~ DepositoryAbbrvOtherSpecified+ResolvableScoreRevised)
+xtabs(~ TypeOfDataset+ResolvableScoreRevised)
+xtabs(~ BroaderDatatypes+ResolvableScoreRevised)
+library(Design)
+ddist4<- datadist(Journal+YearCode+DepositoryAbbrv+BroaderDatatypes)     #can't get this to run with all the factors at once, or even two at a time
+options(datadist='ddist4')
+ologit4<- lrm(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, data=a, na.action=na.pass)
+print(ologit4)
+#modify to match output before running-->#Y repeats
+ sf <- function(y)
+      c('Y>=0'=qlogis(mean(y >= 0)),'Y>=1'=qlogis(mean(y >= 1)),
+      'Y>=2'=qlogis(mean(y >= 2)))
+s <- summary(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, fun=sf)
+s
+text(Stop!)#modify to match output before running -->which, xlim
+plot(s, which=1:3, pch=1:3, xlab='logit', main=' ', xlim=c(-2.3,1.7))
+#Error in datadist(Journal + YearCode + DepositoryAbbrv + BroaderDatatypes) :
+#  fewer than 2 non-missing observations for Journal + YearCode + DepositoryAbbrv + BroaderDatatypes
+#In addition: Warning messages:
+#1: In Ops.factor(Journal, YearCode) : + not meaningful for factors
+#2: In Ops.factor(Journal + YearCode, DepositoryAbbrv) :
+#  + not meaningful for factors
+#3: In Ops.factor(Journal + YearCode + DepositoryAbbrv, BroaderDatatypes) :
+#  + not meaningful for factors
+## i got this error before when I was running in it non-factor, but it cleared up when I either ran less variables at once or coded to dummy variables (1,2,3,4, etc) instead of letters (ea, eco, bio, etc)
+## internet searches primarily turning up code that i don't understand or a few dicussion forms that don't make sense to me
+## search terms used: "datadist" & "not meaningful for factors"l; "datadist" and "fewer than 2 non-missing"
+##don't get this problem when running each factor separately. i ran most separately to practice interpretation.  main problem is that ME (journal) is correlated with genbank (depository) and gene (datatype) = each comes out significantly "better" when run as a separate model (factor by factor)... this is where a multiple factor model (which isn't working) would come in handy...to (maybe) tease these apart (i.e. is publishing in ME, reusing from genbank, or using a gene what determines resolvability/attribution)
+Sincerely,
+Sarah Walker Judson
+[Quoted text hidden]
+		ReuseDatasetsSnap.csv
+K
+Heather Piwowar <hpiwowar@gmail.com>	 Thu, Jul 22, 2010 at 8:08 PM
+Reply-To: hpiwowar@gmail.com
+To: Sarah Walker Judson <walker.sarah.3@gmail.com>
+Hi Sarah,
+Hmmmm interesting.  Possibly related to zeros in your crosstabs?  I may have some ideas.  No time now but will dig in tomorrow morning.
+Heather
+Heather
+[Quoted text hidden]
+Heather Piwowar <hpiwowar@gmail.com>	 Fri, Jul 23, 2010 at 7:17 AM
+Reply-To: hpiwowar@gmail.com
+To: Sarah Walker Judson <walker.sarah.3@gmail.com>
+Sarah,
+Good question.  I'd say yes, go for it.  I agree, it helps to flush out the story. I like the first option better...  a linear combination of the other two, then, isn't it?
+Heather
+[Quoted text hidden]
+Heather Piwowar <hpiwowar@gmail.com>	 Fri, Jul 23, 2010 at 7:22 AM
+Reply-To: hpiwowar@gmail.com
+To: Sarah Walker Judson <walker.sarah.3@gmail.com>
+So based on my gut and what Todd said, I think combine them.  Include a binary variable for snapshotYN... we hope that that one is not significant, but it will help catch things if it is.  And you have a variable for year already, right?
+I'm not worried about interpretability.
+All of this said, once we get the stats working in general, it is probably worth an email to the data_citation list summarizing your approach and prelim interpretations, so they can give feedback if anything is out of wack methodologically.
+Heather
+[Quoted text hidden]
+Heather Piwowar <hpiwowar@gmail.com>	 Fri, Jul 23, 2010 at 7:23 AM
+Reply-To: hpiwowar@gmail.com
+To: Sarah Walker Judson <walker.sarah.3@gmail.com>
+Sarah, I'm busy till 10am but will have time after that to dig in.
+Heather
+[Quoted text hidden]
+Sarah Walker Judson <walker.sarah.3@gmail.com>	 Fri, Jul 23, 2010 at 10:33 AM
+To: hpiwowar@gmail.com
+Heather -
+Thanks so much again for your help! Here's the promised "long version"....sorry it took me awhile to get it to you. I apologize for the length and don't necessarily expect lengthy responses to each portion, writing this out is helping me think through it all and hopefully helping you become more acquainted with my data. I indicate the most important questions with a double asterix.
+First, attached (PrelimOutputs.txt) are some preliminary results with my interp written under each output. Even though I know we want to be running multiple factors at once (I think) I found this to be a useful exercise to familiarize myself with the statistic and running it in R. It started to reveal some support for trends I was expecting to see (i.e. gene sequences are more resolvable and attributed), so that's promising.
+**Like I said before, the main problem I'm having is running all the factors together. I don't understand the error I'm getting and can't find much help on it (at least not that I can understand). This would be especially helpful for distinguishing if publishing in Molecular Ecology, using a gene sequence, or utilizing Genbank is most influential in having a resolvable/attributable data citation. But at the same time, these are all correlated so it might just be more of a mess because of multicollinearity.
+I have some more specific questions in general and about the attribution/resolvability scoring.
+In general:
+- You mentioned "subdiscpline" as a factor yesterday. Were you referring to what I call "data type" or the discipline of the journal? Concerning the later, many of the journals are classified (according to ISI) by usually two of our three major disciplines. I.E. American Naturalist is classified as Ecology and EvoBio, GCB is classified as Environmental Science and Ecology. Few have just one. I coded this for now as binary for each discipline, but given the existing problems with multiple factors, this might be too much to add. Also, I tend to think most of the journals belong to one discipline more strongly than the other...i.e. I would say AmNat is Ecology, Sysbio is EvoBio, and GCB is Environ Sci, etc. This would also reduce the number of factors for this category. Thoughts?
+**- By testing so many factors and character states, aren't we pretty prone to Type 1 error? How do we "prevent" this? Does running factors separately vs. combined help at all?
+- For some papers I have multiple datasets per paper. During data collection, I had them all pooled and separated by commas to indicate nuances. Primarily, I only split an article into multiple datasets if they were different datatypes OR if one dataset was a self reuse and the other was acquired via another mechanism. There are about 5-10 incidences where a dataset was split even though they were the same datatype because one was attributed/resolvable and the other was not (i.e. they were acquired in different ways). Will this lead to independence problems? (P.S. I have some preliminary sentences about this for the methods if this doesn't make sense, let me know if you need it).
+**- For some of my factors, I have both a "broad" and a "specified" classification. I'm more inclined to the broad for stats, but always hate to toss resolution. Right now I'm most inclined to keep datatypes broad and depository specific. Here are the classifications for comparison.
+Datatypes - Specified (*how data was collected)
+Bio = organismic, living
+Paleo = organismic, fossil
+Eco = community (multi-species)
+GS = gene sequence
+GA = gene alignment
+GO = other gene (blots, protein)
+Ea = earth (soil, weather, etc)
+GIS = layers
+XY = coordinates
+PT = phylogenetic tree
+Datatypes - Broad (*What i am currently using)
+G = gene (Gs, Ga, Go)
+O = organismic (living and fossil = Bio and Paleo)
+S= spatial (GIS, XY)
+Eco = community (multi-species)
+Ea = earth (soil, weather, etc)
+PT = phylogenetic tree
+Dataypes - Broader still  (haven't attempted)
+Ecology = organismic (Eco, Bio, Paleo)
+Environ Sci= spatial & earth
+EvoBio = gene (PT, GS, GA, GO)
+Depository - Specified (*currently using)
+G = Genbank
+T = treebase
+U = url or database (non-depository)
+E = extracted literature
+O = other (correspondence, not indicated)
+Depository - Broad (*results similar to above)
+G = Genbank
+T = treebase
+O = other (url, extracted, correspondence, not indicated)
+Depository - Binary (haven't attempted)
+D = Depository (i.e. people can both deposit and extract data = genbank, treebase)
+O= other
+Resolvability:
+- I'm having a little problem that will probably require recoding where I only counted a depository reference if it was in the body of the text, but not in supplementary appendices or even a table caption. I think I started counting a depository reference later in data collection if it was in the table caption, but still not if it was in the supplementary caption. I want to get your opinion on how this should be coded in the resolvability categories:
+="no information, can't find it" = none of the below
+="could find it with extra research"= depository or author or accession ONLY
+="could find it just with info provided in the paper" = depository and (author and/or accession)
+**I think a table with Genbank mentioned in the table caption and accessions given therein should be a "2". However, I think Genbank mentioned in the header of an appendix followed by accession (i.e. same table as previous but in supplementary information) should be counted as a "1" because you would have to track down the supplementary information, which in the case of sysbio and other articles is difficult. Again, this is considering that genbank was never mentioned in the body of the paper, but the authors said something like "additional information about sequences is provided in appendix a". This gives the reader no guarantee that when they dig up the appendix that it will actually have accession numbers, as it may just describe which taxa each sequence was from or a museum voucher number for the specimen. So that's my bias, just want to see if it's justified in your mind. It will require some recoding no matter.
+ a little bit of a problem with f/ac
+Attribution:
+Another quick question about scoring which like the above requires lengthy text to explain. Here are my final scoring categories and explanations:
+= "the data is not attributed" - no author or acession (no author also = a self citation (i.e. previous review paper) but other reuse...i.e. original data authors not attributed at all)
+= "the data is indirectly attributed" - accession only or author only (author only also  = a self citation (i.e. previous review paper) but other reuse...i.e. original data authors not attributed at all) - this still includes self reuses of previous data. In the discussion, I would then talk about what % self reuse occurs as a caveat/modifier about this information
+= "the data is directly attributed" - author and accession - regardless of self. I think if the author reused their own data and gave the accession number, that's great (it happened so rarely, so I appreciated it when it did....it seemed less like personal aggrandizement to rack up a citation and more of open data sharing...of "hey, you can use this data too" rather than, "please go read my other publications to see if you want my data and maybe can dig up how to get it in those other papers because i don't feel like explaining it here"
+So, my explanations probably show my bias. I think it's ok to include self reuses partly because the sample size is small already and partly because some people legitimately reuse their own data. However, I don't think it should count when they cite themself but really used other data (what I call self citation/other reuse....meaning they refer to their previous collection of data and vaguely state that it was from external sources, but give no credit to the original data authors in the current paper and they might in the previous paper, but I don't think we can assume that). So, again, do my biases/categories seem justified? Should we just throw out self reuses altogether as you've been doing. Also, I should note that as I mentioned above (in "independence"), self reuses and other reuses from the same article were separated for analysis.
+Well, hopefully you survived all that. Thanks again for your diligence and continued help!!
+Sincerely,
+Sarah Walker Judson
+[Quoted text hidden]
+		PrelimOutputs.txt
+K
+Sarah Walker Judson <walker.sarah.3@gmail.com>	 Fri, Jul 23, 2010 at 10:59 AM
+To: hpiwowar@gmail.com
+One crazy idea about the multiple factor problem. It worked when I ran everything together as dummy variables (not binary like the link you sent, but 0,1,2,3, etc)...that was how I ran it before our chat yesterday. I could numerically code/rank the journals, datatypes, etc according to their coefficients when run separately, then run all the factors together to maybe get at which is the most influential (journal vs. datatype vs. year). I dunno if that even works at all, but it's the only plausible work around I can think of given I don't know a ton about this method. It's probably totally unconventional, but I thought I might as well mention it.
+Sincerely,
+Sarah Walker Judson
+P.S. I'm on gchat most of the day (i.e until 6pm), but will be invisible as usual.
+[Quoted text hidden]
+Heather Piwowar <hpiwowar@gmail.com>	 Fri, Jul 23, 2010 at 1:10 PM
+Reply-To: hpiwowar@gmail.com
+To: Sarah Walker Judson <walker.sarah.3@gmail.com>
+Hi Sarah.  I'm going to go have lunch and then come back and chat.
+Question:  have you tried datadist with commas rather than +s?
+so:
+ddist4<- datadist(Journal,YearCode,DepositoryAbbrv,BroaderDatatypes)
+These lines seem to run successfully:
+ddist4<- datadist(Journal,YearCode,DepositoryAbbrv,BroaderDatatypes) #can't get this to run with all the factors at once, or even two at a time
+options(datadist='ddist4')
+ologit4<- lrm(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, data=a, na.action=na.pass)
+print(ologit4)
+ok, more chatting later,
+Heather
+==Email: Advice on data collections in stats==
+messages
+Heather Piwowar <hpiwowar@gmail.com>	 Thu, Jul 22, 2010 at 8:07 PM
+Reply-To: hpiwowar@gmail.com
+To: Todd Vision <tjv@bio.unc.edu>, Sarah Walker Judson <walker.sarah.3@gmail.com>
+Todd,
+Could do with some stats advice.
+Sarah collected data in two different ways:  randomly and consecutively.  My guess is that she can concatenate these for her main analysis... maybe with a binary variable indicating the type of data collection to hopefully catch artifacts.
+That said, I'm a bit unsure and I don't want to lead her down the wrong path.
+What do you think?
+Heather
+---------- Forwarded message ----------
+From: Sarah Walker Judson <walker.sarah.3@gmail.com>
+Date: Thu, Jul 22, 2010 at 4:09 PM
+Subject: Re: Scoring and Stats questions_Dataone
+To: hpiwowar@gmail.com
+Also,
+you mentioned that maybe I should pool my "snapshot" (2000/2010) and "time series" (2005-2009 for sysbio and amnat only) to get a bigger sample size. the former was collected sequential and the later randomly. i'm a bit worried this affects assumptions about data collection, but don't now if this is as strict of an assumption in this arena as in biology. i was thinking of running both separately and then pooling and then choosing one as the focus (probably the 2000/2010) on for reporting and stating if/if not the other sets produced similar results.
+thoughts?
+thanks.
+Sincerely,
+Sarah Walker Judson
+On Thu, Jul 22, 2010 at 3:49 PM, Sarah Walker Judson <walker.sarah.3@gmail.com> wrote:
+Sorry, one question I forgot (but isn't urgent since I have a lot to chew on):  should I even attempt the "ideal citation" score, or just worry about the resolvability and attribution components?
+To reiterate (it's kind of a combination of resolvability and attribution):
+Ideal (previously "Good") citation score
+Ideal_CitationYN* This came out the  same as my Knoxville calculation of author+depository+accession
+=Y=Resolvable + Attribution (adding the two previous yes and no categories)
+=N=lacking one or the other or both
+Ideal_CitationScoreSimple
+=not resolvable or attributed
+=attributed ("Yes" in "AttributionYN")
+=resolvable ("Yes" in "ResolvableYN")
+=attributed and resolvable
+Ideal_CitationScoreGoodGradient
+=none
+=depository only
+=author only
+=accession only
+=depository and author
+=depository and accession
+=author and accession
+=depository, author, and accession
+To adapt to an ordinal scale, it could either be:
+Ideal_CitationScoreSimple
+=not resolvable or attributed
+=attributed ("Yes" in "AttributionYN")  or resolvable ("Yes" in "ResolvableYN")
+=attributed and resolvable
+Ideal_CitationScoreGoodGradient
+=none
+=depository only or author only or accession only
+=(depository and author) or (depository and accession) or (author and accession)
+=depository, author, and accession
+Thoughts?
+This might be another "out of the scope of this project" or it might be redundant of resolvability and attribution or it might be essential...I dunno. As I rethink, I think it's probably redundant and not needed, but I originally liked it as an overall metric (i.e. are the citations both resolvable and attributable).....alternatively, is there a way to crosstab and analyze these? Is that maybe the best route?
+Again, no rush...I've got plenty to work on.
+Thanks!!!
+ Sincerely,
+Sarah Walker Judson
+Todd Vision <tjv@bio.unc.edu>	 Fri, Jul 23, 2010 at 4:41 AM
+To: "hpiwowar@gmail.com" <hpiwowar@gmail.com>
+Cc: Sarah Walker Judson <walker.sarah.3@gmail.com>
+I haven't been following the discussion closely enough to be sure, but a general approach would be to combine, test for heteregeneity and, in its absence, accept the stats on the combined sample.  But the population of study for the two sets sounds sufficiently different (ie the identity of journals, as opposed to the random/sequential distinction) that a combined analysis would be difficult to interpret.  I'm not sure you would not want to use them to ask the same questions.
+Todd
+[Quoted text hidden]
+--
+Todd Vision
+Associate Professor
+Department of Biology
+University of North Carolina at Chapel Hill
+Associate Director for Informatics
+National Evolutionary Synthesis Center
+http://www.nescent.org
+Heather Piwowar <hpiwowar@gmail.com>	 Fri, Jul 23, 2010 at 7:14 AM
+Reply-To: hpiwowar@gmail.com
+To: Todd Vision <tjv@bio.unc.edu>
+Cc: Sarah Walker Judson <walker.sarah.3@gmail.com>
+Thanks Todd, that's helpful.
+Heather
 ==Chat Transcript July 22==