# DataONE:Notebook/Summer 2010/2010/07/26

DataONE summer internships 2010 Main project page
Previous entry      Next entry

Search this Project

Customize your entry pages

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.

## Chat, Heather and Nic

```
From: Nicholas Weber <nicholas.m.weber@gmail.com>
Date: 2010/7/26
Subject: Chat with Nicholas Weber
To: hpiwowar@gmail.com

9:53 AM
Heather: Hi Nic, how's it going?
19 minutes
10:12 AM
Nicholas: Hi
10:13 AM
It's ok, I tried what you suggested by making the publishers into categories, but nothing they did not have a p-value <.05
7 minutes
10:21 AM
Heather: Hi Nic

No problem

at this point all p-values should be taken with a grain of salt anyway

because it is not clear that we have all the relevant variables in our equation

do you have some more time now or later?
10:22 AM
thinking we can go over how to compute predicted probabilites, and then

what variables should be in your equation :) What do you think

?

do you have other questions first, or ?

(oh, probably lunch time for you, eh? tell me when will be a good time)
10:23 AM
Nicholas: hi sorry

no I was just in R trying to figure something out
10:24 AM
now is good, or whenever, I'll turn my volume on so I hear it if you ping me

Heather: ok now it good for me too

Nicholas: one quick question

Heather: yes
10:26 AM
Nicholas: when I was trying to create a variable out of the is.Ecology

i kept getting NA's instead of O's

Heather: ok, what is the code you are trying

Nicholas: I was doing : > is.Ecology=0
>
> is.Ecology[sub("*Ecology*", ISI.Category)] = 1

nope

sorry

grep instead of sub
10:27 AM
is.Ecology[grep("*Ecology*", ISI.Category)] = 1

is set is.Ecology to 0 initially
10:28 AM
Heather: ok, what is your exact code?

are you doing

Nicholas: is.Ecology=0

is.Ecology[grep("*Ecology*", ISI.Category)] = 1
10:29 AM
but then when I try is.Ecology I get NA instead of 0

Heather: yes, ok so I think this part is a bit tricky

the problem is that when you say

Nicholas: so when I try to do "mylogit = glm(requests~Impact.Factor + is.Ecology, family=binomial(link="logit"), na.action=na.pass)"

Heather: is.Ecology=0

Nicholas: I get an error

Heather: it doesn't know if you want that to just be a number, or a "vector" or what
10:30 AM
Nicholas: oh

Heather: when you say mydataframe\$is.Ecology = 0

then it knows that you want every row in your data frame to have a 0 in it

but when you use the attach

then add a new variable

R doesn't know if you want that to be a "data frame" sort of variable, or just a single number 0, or what

so it defaults into a single number 0, I think
10:31 AM
there are a few fixes

Nicholas: shoo,t I think I remember reading that in the intro too

ok, sorry

Heather: is.Ecology[!grep("*Ecology*", ISI.Category)] = 0

nope sorry

I meant to type that into R first to try it out, and it doesn't seem to work
10:32 AM
ummm hmmm, there are lots of fixes, I'm just trying ot think of the nicest

how about is.Ecology = rep(0, length(ISI.Category))
10:33 AM
yeah, then your line

is.Ecology[grep("*Ecology*", ISI.Category)] = 1

does that seem to work?

Nicholas: yup that worked

Heather: rep means repeat
10:34 AM
so repeat the number 0 many times, the same number of times items in ISI.Category

make sense?

same number of times AS items in ....

Nicholas: yeah

Heather: cool

ok, probabilities then talking variables

to in the tutorial,
10:35 AM
we'll skip the section about deviance residuals for now

and log likelihood

and jump to the red part called "using predicted probabilities"

Nicholas: ok
10:36 AM
Heather: try these three lines:

newdata1 = data.frame(noOA = c(1), Impact.Factor=c(1, 5, 10, 15, 20, 25))

newdata1\$requestsP = predict(mylogit, newdata=newdata1, type="response")

newdata1

you get a table with some probability looking things

?

Nicholas: yes
10:37 AM
Heather: ok, so what that means is that I gave our "model", our regression equation, the thing stored in mylogit some example datapoints

I gave it datapoints where noOA is 1 and Impact.Factor is several possibilities

and it used the coefficients it estimated and plugged in those values
10:38 AM
and computed the probability that a datapoint like that would have a sharing policy

so for a journal that is not OA and has an IF of 1, the probability is just 8%
10:39 AM
but for a journal that is not OA and has an IF of 15, the probability is 60%

how try these three lines:

newdata1 = data.frame(noOA = c(0), Impact.Factor=c(1, 5, 10, 15, 20, 25))

newdata1\$requestsP = predict(mylogit, newdata=newdata1, type="response")

newdata1

this is the same, but noOA = 0... so in other words, it is a journal that has some OA
10:40 AM
for a jorunal with some oa and an IF of 1, it has a 14% prob of data sharing policy, and with an IF of 15 it has a 72% probability

make sense?

it is just a different way to interpret what the slopes/coefficients are actually telling us :)
10:41 AM
Nicholas: ok

Heather: questions?
10:42 AM
Nicholas: I think I added a column so I'm seeing something different

one sec

Heather: oh I see, yeah that woudl be confusing
10:43 AM
here is the orig that I'm using:

mylogit = glm(requests~Impact.Factor + noOA, family=binomial(link="logit"), na.action=na.pass)

then
10:44 AM
newdata1 = data.frame(noOA = c(1), Impact.Factor=c(1, 5, 10, 15, 20, 25))

newdata1\$requestsP = predict(mylogit, newdata=newdata1, type="response")

newdata1

then do you get the first line ending with

0.08497077

Nicholas: yup

ok got it now

Heather: cool
10:45 AM
ok, so for no OA, IF=1, prob of plan = 8%

no OA, IF 15, prob of plan = 60%

make more sense now?

Nicholas: yup
10:46 AM
as IF goes up, probablity of a data sharing plan increases if no OA

Heather: right. now what about if some OA?

then would try it again with the noOA variable set to 0 like this:

newdata1 = data.frame(noOA = c(0), Impact.Factor=c(1, 5, 10, 15, 20, 25))
10:47 AM
newdata1\$requestsP = predict(mylogit, newdata=newdata1, type="response")

newdata1

so as IF goes up, probablity of a data sharing plan increases if some OA, also

and the overall level is higher, right?

Nicholas: yes, increase is more dramatic
10:48 AM
Heather: yes. looked at a different way, you could also be tempted to say, based on this data

at an IF of 15, prob of plan if noOA is 60% and prob of plan if some OA is 72%
10:49 AM
but, you want to keep in mind when you say that that the noOA variable is pretty unclear in our model

its p-value was higher than 0.05

its confidence intervals were wide, etc
10:50 AM
so while there is a coefficient calculated and the table of probabilities uses that coefficient, we want to be careful about saying things about trends in the noOA variable

whereas we feel quite confident about saying things about trends in IF because its p-value was signifiant

see how the p-value and confidence interval interpretations need to be at the back of our mind as we play with the numbers?

confusing, or ok?
10:51 AM
Nicholas: no I understand, but...this might be a naive question, why would we continue to play with the numbers if we know our p value and conf int aren't sig?

Heather: yeah, it is a good question
10:52 AM
some people woudl advocate for dropping noOA out of our model when we realize it isn't signifcant

and rerunning our statistics

and calculating our proabililities based on the streamlined model
10:53 AM
that said, the 5% threshold is in some ways arbitrary

and so others say we should keep the nonsignificant things in anyway, because we don't actually know enough to take them out

so ???. it depends.
10:54 AM
mostly, whether you keep them in or take them out, you need to be very hedgy whenever you say anything about them

does that help?

Nicholas: yes
10:55 AM
so when you present stats like that, if you choose to, how do you present them

by present I mean write about them

Heather: yup

that's one of the things I love about the tutorial....
10:56 AM
"These findings can also be interpreted using predicted probabilities. With all other variables held constant at their mean, the probability of admission for a gpa of 2.0 was .15, while a gpa of 3.0 resulted in a .26 probability of admission and a gpa of 4.0 was associated with a .40 probability of admission. Likewise, for gre scores of 400, 500, 600 and 700, the probabilities of admission were .22, .26, .31 and .37, respectively, while holding other predictors constant at their mean."

to put that in our language:
10:57 AM
For jorunals with no OA content, for IF values of 1, 5, 10, ... , the probabilities of having a data shairng plan were x, w, sd, respectively

Likewise, for journals with OA content, .....
10:58 AM
And in the first sentence I would say:
10:59 AM
A logit regression was used to predict whether a journal had a data sharing plan from the journal's impact factor and whether or not it publishes any articles open access. IF was a significant predictor of

having a data sharing plan, but publishign OA content was not statistically significantly related to having a data sharing plan.

does that help?
11:00 AM
Nicholas: yes, thats great.

Heather: cool. yeah, I get how having it very applied to your own case can help a lot :)

ok, so one more caveat I want to emphasize before we get start planning analyses
11:01 AM
it can be tempting to really run with this stat analysis stuff

but important to remember its limitations

lots of limitations, but the one I'll emphasize right now is that

it is correlation not causation

so it isn't necesarily true that having a high IF CAUSES the journal to have a data hsaring plan
11:02 AM
Nicholas: right

Heather: it could be that having a data sharing plan CAUSES it to have a high IF (ok, unlikely, but still)

or having a high IF and having a data sharing plan are both correlated to being published out of England, or something, where they are good publishers and they love data sharing or something
11:03 AM
anyway, just want to reiterate that so that you can practice writing things up accordingly

Nicholas: ok

Heather: it can be tempting to say "increases" but often more appropriate to say

"is associated with increased" or something
11:04 AM
disputes can be had about which is better scientific writing, and don't second guess yourself much as you are writing this up

jsut want to highlight the issue

ok?

Nicholas: thats very helpful

Heather: ok
11:05 AM
now.

what variables do you think you want to look at?

we'll think about what variables, and how they will be coded

one detail is that you can't include oodles of variables

because you only have so much data

you have 307 datapoints, right?

Nicholas: right
11:06 AM
Heather: so opinions differ, but a rule of thumb is to have about 30 datapoints for every coefficient you are trying to estimate

so that would be about 10 coefficients

and depending on your variables, that is probably fewer than 10 variables.
11:07 AM
if, for example, you had a publisher column with publisher A, B, C, and other, that is actually 3 coefficients (4 minus one for the base case)

so you can see you can spend the 10 coefficients really fast

Nicholas: yes

Heather: so don't get tooo hooked up on the 10 right now

but it does mean we need to focus on what is important
11:08 AM
and probably leave most of the "nice to haves" out

should I give you a bit more background about why you can't have 42 variables in the equation, or is that enough info on that for now?
11:09 AM
Nicholas: um, I think we can probably move on-- I don't the why, but it makes sense that you can't calculate that many variables for a limited amount of datapoints

don't know the why that should read

Heather: yeah, good. It doesn't make sense to go into the why on everyting today, that is for sure :)
11:10 AM
ok.

so what would be the journal variables at the top of your list? and/or what ones do you think are not important?

Nicholas: ok, so as far as significant variables, IF, Requested / Required are definitely important

Heather: where "not important" is "not important to test"

Nicholas: I think the categories, Ecology, Env Sci and EvoBio probably should be as well
11:11 AM
but I don't know that

we've already seen that the subscription model probably isn't sig.

Heather: Nope!

we haven't see that yet :)

we just played around with it in some test stats

but those were done for learning

and not for interpreting yet
11:12 AM
so don't read anything into them please :)

Nicholas: the "has instructions how to cite data" has a really small number of observations

Heather: it could be that when you include subscription model in an equation that also has the ISI categories the subscription model is actually relevant, for example.....

Nicholas: Affiliation, would be interesting to have as well
11:13 AM
Heather: where Affiliation woudl be "does it have an affiliation?"

Nicholas: yes, a society affiliation
11:14 AM
Heather: ok, so is it affiliated to a society

btw do you get what I was saying about subscription model?
11:15 AM
given that we don't know anything about it yet, do you think it is a must have in the initial analysis?

Nicholas: yes, just because it wasn't sig for what we just ran, doesn't mean that its not signficant in all cases

Heather: right... but even more....

Nicholas: I do
11:16 AM
Heather: we were just running things unofficially, as learning

Nicholas: right we can't discard it just because we know it to not be sig right now

Heather: at the risk of belabouring the point, I'm going to say.....
11:17 AM
" because we know it to not be sig right now" nope... we DON'T know it to be not sig right now

we were playing with some data, but we weren't being careful yet, and so our stats inputs and methods probably weren't all proper yet, and so we don't actually know anything yet
11:18 AM
Nicholas: right... yes, I think subscription model is important and it should be included

Heather: ok, cool.

Nicholas: could we divide the publishing group into a Elsevier, Wiley, Springer, Taylor vs everyone else -- or what that not be sound ?
11:19 AM
would that not be sound

Heather: I think that would be ok

anything else?
11:20 AM
Nicholas: I dont think so
11:21 AM
Heather: yup, seems like a good list to me
11:22 AM
so the variable we are regressing on, our dependant variable, is "does the journal have a data sharing policy that requests or requires data sharing" aka requests (or some other name?)

and the indep variables are:

IF

subscription model

categories

publisher

society affiliation

did I get them all?
11:23 AM
Nicholas: its probably not significant to consider "instructions on how to cite data"

because there are so few instances?

Heather: yes, and it isn't exactly clear how it relates to the purpose of this analysis

well, that is too strong

Nicholas: oh, yes that too :)
11:24 AM
Heather: it isn't quite related to the purpose of this analysis

I think ideally we would do another, very similar analysis

where instead of predicting whether the journal had a polciy to request data sharing, we look at whether they have a policy about how to cite data

alas we may not have enough +ve case to have much of a story
11:25 AM
another "ideally we would do" is unlump requests from requires in data sharing policy

I think let's put that one on the backburner for today

but keep it in mind, write it down as a todo for tomorrow or something

Nicholas: ok

and to do so

Heather: does that all make sense?
11:26 AM
yes, so mostly I don't think I've shown you the tools yet to do requests vs requires well

Nicholas: ok

yes that makes sese

snese

Heather: it is a step more complicated than logistic regression

so let's hold off till we nail this one :)
11:27 AM
Nicholas: good plan

Heather: btw, you do realize that you are currently doing Stats 301 or something, right? I mean if you feel lost ever, it is with good reason.

you are holding on well, nic.

keep askign if/when/as it gets confusing
11:28 AM
Nicholas: oh thanks, I don't know about that but thanks...I've already lined up a stats tutor for next semester

Heather: and I'm guessing you'll feel all sorts of Ahhhh moments next term as you realize in retrospect all the things we are doing :)

Nicholas: fingers crossed

Heather: and no doubt some urggg moments as you realize some of the steps we are skipping, but oh well, c'est la vie :)

ok.
11:29 AM
Nicholas: quick question

Heather: yes?

Nicholas: have you taught stats before?

Heather: nope

I haven't taught anything before, actuall

Nicholas: well, you're a patient instructor you'd thrive
11:30 AM
Heather: that is kind of you. I do like it. I wish I'd done it before, though, clearly more practice about what order to talk about things, what exampes to use, etc is useful!

Nicholas: you can cut your teeth and google chat, and then take your refined skills to the classroom
11:31 AM
Heather: yeah. though it feels like it would be a whole different world. hard to tell if the students are really with you, no?

Nicholas: yeah, definitely... ok thanks for indulging my curiosity
11:32 AM
Heather: ok. variables.

I think we can do a bit here and then you'll be equipped to go do a bit more R coding and running on your own

then we can sync up in a few hours again

does that work?

Nicholas: sounds great

Heather: I'm around till 3 my time

ok
11:33 AM
so let's think about our variables

and what kind of variables they are

impact factor is a real number, right? a float.

Nicholas: yes

Heather: now ideally when you have real numbers in a regression, they have the shape of a bell curve or a normal distribution (same thing)
11:34 AM
if you do

hist(Impact.Factor)

you can see in the graph that it doesn't look like a bell curve

Nicholas: no it doesnt
11:35 AM
Heather: now try this

hist(log(Impact.Factor))

much more like it, eh?

Nicholas: yes

Heather: do you know enough about logs to know what that is doing?

Nicholas: no

Heather: ok no problem
11:36 AM
ok, I'm not going to explain it in much detail

it is actually pretty cool, useful, prevalent

but right now we are just going to treat it as a black box

Nicholas: thats fair
11:37 AM
Heather: it is a "transform"
11:38 AM
that takes the number 1 and turns it into 0

log(1)

so an IF of 1 has a log(IF) of 0

it takes numbers between 0 and 1 and turns them into negative numbers... more negative the smaller they are

log(0.1)

log(0.01)
11:39 AM
and takes numbers bigger than 1 and turns them into positive numbers

but it shrinks the scale down such that the logs of really big numbers aren't that much larger than the logs of medium-big numbers
11:40 AM
log(10)

log(10000)

log(100000000)

make enough sense?

Nicholas: yes

Heather: so we are taking the log of all of the values in our impact factor

and putting those into our model instead of the impact factor
11:41 AM
and that will make the "fit a best fit line" math be more robust

becaues that math assumes that the data coming in has a normal distribution
11:42 AM
the only tricky part is that it can't handle impact factors of 0

log(0)

I'm guessing 0 is actually NA in this case?

Nicholas: right, I noticed that yesterday, there shouldn't be an impact factor of 0... it should be NA right?

Heather: yes
11:43 AM
for now, I'll work around it by just adding 0.1 to it as a temporary hack

so to add the log of the IF in the model, you do this:

mylogit = glm(requests~log(0.1 + Impact.Factor) + noOA, family=binomial(link="logit"), na.action=na.pass)

summary(mylogit)
11:44 AM
it does make the coefficient a bit harder to interpret

(though I think if you replace the 0 with an NA and take out the +0.1 hack it will get easier)
11:45 AM
but you can see that now the noOA variable is a bit more signifcant

p=0.078

so still not sig at p<0.05 .... but more so

modelling your data better can show things that are otherwise hidden
11:47 AM
ok

so let's leave impact factor at that for now

subscription model

it has three levels now, right?

Nicholas: yes
11:48 AM
Heather: I think maybe for the purposes of this analysis we might want to collapse it into two

to save "the number of variables"

and make it easier to interpret

I'll leave it up to you to decide what to collapse

this will make it a binary variable

something like

hassomeOA = 1 or 0
11:49 AM
depends on what you decide

ok?

Nicholas: ok

Heather: ok, categories

there are how many main categories?

and can journals belong to more than one?
11:50 AM
Nicholas: yes

I was just intereted in the three categories we originally used to gather this list

Heather: yes, cool

Nicholas: Ecology, EvoBio and Env Sci
11:51 AM
Heather: then a binary variable for each of those?

since some journals are members of more than one?

Nicholas: ... I guess so

Heather: ok. it is true that some journals are members of more than one, right?

Nicholas: yes
11:52 AM
Heather: (btw I'm using the word binary. it just means two-valued. usually 0 and 1.

)

ok cool

then publisher

pick the big 3? the big 4?

you still there?
11:53 AM
for publisher, pick the big 3 or 4 and call the others "Other"

Hi Nic. I think we lost the connection. are you still there?
11:54 AM
Nicholas: yes

ok, publisher

Heather: pick the big 3 or 4 and call the others "Other"

Nicholas: ok

Heather: have them all in one column. so not an is.Wiley column, but instead a
11:55 AM
publisher_code column or something

that has one of four values in it

"Wiley", "Elsevier", "Other" or whatever

we'll need to work together to figure out how to interpret the results once you get this column
11:56 AM
the society affiliation column is a binary one, just 0 or 1?

Nicholas: I have both a yes/ no and a numeric 1 - 0

Heather: great!

Nicholas: for affiliaiton

Heather: ok, that's all of them?

so if you can get the data in that format,

write it up on your OWW page,
11:57 AM
try to put them all in the model, separated by + signs

Nicholas: ok

Heather: then we'll see what we have :)

Nicholas: ok

Heather: remember to hold off interpretations still.... because doing data analysis takes several attepts to get right

esp when still a newbie (and I count as that too)

thing slike realizing, duh my impact factor woudl really be better as a log transform
11:58 AM
can change your results and thus your interpretations

so treat it all as a work in progress

Nicholas: ok

Heather: until we are convinced that the data is a good representation, doesn't have mistakes, the model is a good fit, etc etc etc
11:59 AM
ok!

I'll probably be away from chat for a bit, but then will be back

Nicholas: ok. I'll be here

Heather: for sure let's talk again in an hour or two....

Nicholas: until about 5 my time, then I have a conf to go to

ok

Heather: ok.
12:00 PM
enjoy the conf, cool.

see you in a bit! (oh, btw, I'm assuming you are adding this to an .R file and putting it in git....

if not, woudl be a good time now....

have you figured out how to add new revisions to a git?

to a gist, I mean?
12:01 PM
that way you can keep the same ID number and just update the contents

Nicholas: I've got a text fiel that I'll put into a gist

Heather: yeah great

taht would probably be an easy way for me to see/run your code

play with the revisions stuff till you figure it out or ask.

ok, later!
12:02 PM
Nicholas: bye thanks again

Heather: my pleasure!
```