DataONE:Notebook/Summer 2010/2010/07/26

From OpenWetWare

Jump to: navigation, search
DataONE summer internships 2010 Main project page
Previous entry      Next entry

Search this Project

Customize your entry pages

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.

Home        People        Research        Summer 2010        Resources       

Chat, Heather and Nic

From: Nicholas Weber <>
Date: 2010/7/26
Subject: Chat with Nicholas Weber

9:53 AM 
Heather: Hi Nic, how's it going?
19 minutes
10:12 AM 
Nicholas: Hi
10:13 AM 
It's ok, I tried what you suggested by making the publishers into categories, but nothing they did not have a p-value <.05
7 minutes
10:21 AM 
Heather: Hi Nic
No problem
at this point all p-values should be taken with a grain of salt anyway
because it is not clear that we have all the relevant variables in our equation
do you have some more time now or later?
10:22 AM 
thinking we can go over how to compute predicted probabilites, and then
what variables should be in your equation :) What do you think
do you have other questions first, or ?
(oh, probably lunch time for you, eh? tell me when will be a good time)
10:23 AM 
Nicholas: hi sorry
no I was just in R trying to figure something out
10:24 AM 
now is good, or whenever, I'll turn my volume on so I hear it if you ping me
Heather: ok now it good for me too
Nicholas: one quick question
Heather: yes
10:26 AM 
Nicholas: when I was trying to create a variable out of the is.Ecology
i kept getting NA's instead of O's
Heather: ok, what is the code you are trying
Nicholas: I was doing : > is.Ecology=0
> is.Ecology[sub("*Ecology*", ISI.Category)] = 1
grep instead of sub
10:27 AM 
is.Ecology[grep("*Ecology*", ISI.Category)] = 1
is set is.Ecology to 0 initially
10:28 AM 
Heather: ok, what is your exact code?
are you doing
Nicholas: is.Ecology=0
is.Ecology[grep("*Ecology*", ISI.Category)] = 1
10:29 AM 
but then when I try is.Ecology I get NA instead of 0
Heather: yes, ok so I think this part is a bit tricky
the problem is that when you say
Nicholas: so when I try to do "mylogit = glm(requests~Impact.Factor + is.Ecology, family=binomial(link="logit"), na.action=na.pass)"
Heather: is.Ecology=0
Nicholas: I get an error
Heather: it doesn't know if you want that to just be a number, or a "vector" or what
10:30 AM 
Nicholas: oh
Heather: when you say mydataframe$is.Ecology = 0
then it knows that you want every row in your data frame to have a 0 in it
but when you use the attach
then add a new variable
R doesn't know if you want that to be a "data frame" sort of variable, or just a single number 0, or what
so it defaults into a single number 0, I think
10:31 AM 
there are a few fixes
Nicholas: shoo,t I think I remember reading that in the intro too
ok, sorry
Heather: is.Ecology[!grep("*Ecology*", ISI.Category)] = 0
nope sorry
I meant to type that into R first to try it out, and it doesn't seem to work
10:32 AM 
ummm hmmm, there are lots of fixes, I'm just trying ot think of the nicest
how about is.Ecology = rep(0, length(ISI.Category))
10:33 AM 
yeah, then your line
is.Ecology[grep("*Ecology*", ISI.Category)] = 1
does that seem to work?
Nicholas: yup that worked
Heather: rep means repeat
10:34 AM 
so repeat the number 0 many times, the same number of times items in ISI.Category
make sense?
same number of times AS items in ....
Nicholas: yeah
Heather: cool
ok, probabilities then talking variables
to in the tutorial,
10:35 AM 
we'll skip the section about deviance residuals for now
and log likelihood
and jump to the red part called "using predicted probabilities"
Nicholas: ok
10:36 AM 
Heather: try these three lines:
newdata1 = data.frame(noOA = c(1), Impact.Factor=c(1, 5, 10, 15, 20, 25))
newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response")
you get a table with some probability looking things
Nicholas: yes
10:37 AM 
Heather: ok, so what that means is that I gave our "model", our regression equation, the thing stored in mylogit some example datapoints
I gave it datapoints where noOA is 1 and Impact.Factor is several possibilities
and it used the coefficients it estimated and plugged in those values
10:38 AM 
and computed the probability that a datapoint like that would have a sharing policy
so for a journal that is not OA and has an IF of 1, the probability is just 8%
10:39 AM 
but for a journal that is not OA and has an IF of 15, the probability is 60%
how try these three lines:
newdata1 = data.frame(noOA = c(0), Impact.Factor=c(1, 5, 10, 15, 20, 25))
newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response")
this is the same, but noOA = 0... so in other words, it is a journal that has some OA
10:40 AM 
for a jorunal with some oa and an IF of 1, it has a 14% prob of data sharing policy, and with an IF of 15 it has a 72% probability
make sense?
it is just a different way to interpret what the slopes/coefficients are actually telling us :)
10:41 AM 
Nicholas: ok
Heather: questions?
10:42 AM 
Nicholas: I think I added a column so I'm seeing something different
one sec
Heather: oh I see, yeah that woudl be confusing
10:43 AM 
here is the orig that I'm using:
mylogit = glm(requests~Impact.Factor + noOA, family=binomial(link="logit"), na.action=na.pass)
10:44 AM 
newdata1 = data.frame(noOA = c(1), Impact.Factor=c(1, 5, 10, 15, 20, 25))
newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response")
then do you get the first line ending with
Nicholas: yup
ok got it now
Heather: cool
10:45 AM 
ok, so for no OA, IF=1, prob of plan = 8%
no OA, IF 15, prob of plan = 60%
make more sense now?
Nicholas: yup
10:46 AM 
as IF goes up, probablity of a data sharing plan increases if no OA
Heather: right. now what about if some OA?
then would try it again with the noOA variable set to 0 like this:
newdata1 = data.frame(noOA = c(0), Impact.Factor=c(1, 5, 10, 15, 20, 25))
10:47 AM 
newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response")
so as IF goes up, probablity of a data sharing plan increases if some OA, also
and the overall level is higher, right?
Nicholas: yes, increase is more dramatic
10:48 AM 
Heather: yes. looked at a different way, you could also be tempted to say, based on this data
at an IF of 15, prob of plan if noOA is 60% and prob of plan if some OA is 72%
10:49 AM 
but, you want to keep in mind when you say that that the noOA variable is pretty unclear in our model
its p-value was higher than 0.05
its confidence intervals were wide, etc
10:50 AM 
so while there is a coefficient calculated and the table of probabilities uses that coefficient, we want to be careful about saying things about trends in the noOA variable
whereas we feel quite confident about saying things about trends in IF because its p-value was signifiant
see how the p-value and confidence interval interpretations need to be at the back of our mind as we play with the numbers?
confusing, or ok?
10:51 AM 
Nicholas: no I understand, but...this might be a naive question, why would we continue to play with the numbers if we know our p value and conf int aren't sig?
Heather: yeah, it is a good question
10:52 AM 
some people woudl advocate for dropping noOA out of our model when we realize it isn't signifcant
and rerunning our statistics
and calculating our proabililities based on the streamlined model
10:53 AM 
that said, the 5% threshold is in some ways arbitrary
and so others say we should keep the nonsignificant things in anyway, because we don't actually know enough to take them out
so ???. it depends.
10:54 AM 
mostly, whether you keep them in or take them out, you need to be very hedgy whenever you say anything about them
does that help?
Nicholas: yes
10:55 AM 
so when you present stats like that, if you choose to, how do you present them
by present I mean write about them
Heather: yup
that's one of the things I love about the tutorial....
10:56 AM 
"These findings can also be interpreted using predicted probabilities. With all other variables held constant at their mean, the probability of admission for a gpa of 2.0 was .15, while a gpa of 3.0 resulted in a .26 probability of admission and a gpa of 4.0 was associated with a .40 probability of admission. Likewise, for gre scores of 400, 500, 600 and 700, the probabilities of admission were .22, .26, .31 and .37, respectively, while holding other predictors constant at their mean."
to put that in our language:
10:57 AM 
For jorunals with no OA content, for IF values of 1, 5, 10, ... , the probabilities of having a data shairng plan were x, w, sd, respectively
Likewise, for journals with OA content, .....
10:58 AM 
And in the first sentence I would say:
10:59 AM 
A logit regression was used to predict whether a journal had a data sharing plan from the journal's impact factor and whether or not it publishes any articles open access. IF was a significant predictor of
having a data sharing plan, but publishign OA content was not statistically significantly related to having a data sharing plan.
does that help?
11:00 AM 
Nicholas: yes, thats great.
Heather: cool. yeah, I get how having it very applied to your own case can help a lot :)
ok, so one more caveat I want to emphasize before we get start planning analyses
11:01 AM 
it can be tempting to really run with this stat analysis stuff
but important to remember its limitations
lots of limitations, but the one I'll emphasize right now is that
it is correlation not causation
so it isn't necesarily true that having a high IF CAUSES the journal to have a data hsaring plan
11:02 AM 
Nicholas: right
Heather: it could be that having a data sharing plan CAUSES it to have a high IF (ok, unlikely, but still)
or having a high IF and having a data sharing plan are both correlated to being published out of England, or something, where they are good publishers and they love data sharing or something
11:03 AM 
anyway, just want to reiterate that so that you can practice writing things up accordingly
Nicholas: ok
Heather: it can be tempting to say "increases" but often more appropriate to say
"is associated with increased" or something
11:04 AM 
disputes can be had about which is better scientific writing, and don't second guess yourself much as you are writing this up
jsut want to highlight the issue
Nicholas: thats very helpful
Heather: ok
11:05 AM 
what variables do you think you want to look at?
we'll think about what variables, and how they will be coded
one detail is that you can't include oodles of variables
because you only have so much data
you have 307 datapoints, right?
Nicholas: right
11:06 AM 
Heather: so opinions differ, but a rule of thumb is to have about 30 datapoints for every coefficient you are trying to estimate
so that would be about 10 coefficients
and depending on your variables, that is probably fewer than 10 variables.
11:07 AM 
if, for example, you had a publisher column with publisher A, B, C, and other, that is actually 3 coefficients (4 minus one for the base case)
so you can see you can spend the 10 coefficients really fast
Nicholas: yes
Heather: so don't get tooo hooked up on the 10 right now
but it does mean we need to focus on what is important
11:08 AM 
and probably leave most of the "nice to haves" out
should I give you a bit more background about why you can't have 42 variables in the equation, or is that enough info on that for now?
11:09 AM 
Nicholas: um, I think we can probably move on-- I don't the why, but it makes sense that you can't calculate that many variables for a limited amount of datapoints
don't know the why that should read
Heather: yeah, good. It doesn't make sense to go into the why on everyting today, that is for sure :)
11:10 AM 
so what would be the journal variables at the top of your list? and/or what ones do you think are not important?
Nicholas: ok, so as far as significant variables, IF, Requested / Required are definitely important
Heather: where "not important" is "not important to test"
Nicholas: I think the categories, Ecology, Env Sci and EvoBio probably should be as well
11:11 AM 
but I don't know that
we've already seen that the subscription model probably isn't sig.
Heather: Nope!
we haven't see that yet :)
we just played around with it in some test stats
but those were done for learning
and not for interpreting yet
11:12 AM 
so don't read anything into them please :)
Nicholas: the "has instructions how to cite data" has a really small number of observations
Heather: it could be that when you include subscription model in an equation that also has the ISI categories the subscription model is actually relevant, for example.....
Nicholas: Affiliation, would be interesting to have as well
11:13 AM 
Heather: where Affiliation woudl be "does it have an affiliation?"
Nicholas: yes, a society affiliation
11:14 AM 
Heather: ok, so is it affiliated to a society
btw do you get what I was saying about subscription model?
11:15 AM 
given that we don't know anything about it yet, do you think it is a must have in the initial analysis?
Nicholas: yes, just because it wasn't sig for what we just ran, doesn't mean that its not signficant in all cases
Heather: right... but even more....
Nicholas: I do
11:16 AM 
Heather: we were just running things unofficially, as learning
Nicholas: right we can't discard it just because we know it to not be sig right now
Heather: at the risk of belabouring the point, I'm going to say.....
11:17 AM 
" because we know it to not be sig right now" nope... we DON'T know it to be not sig right now
we were playing with some data, but we weren't being careful yet, and so our stats inputs and methods probably weren't all proper yet, and so we don't actually know anything yet
11:18 AM 
Nicholas: right... yes, I think subscription model is important and it should be included
Heather: ok, cool.
Nicholas: could we divide the publishing group into a Elsevier, Wiley, Springer, Taylor vs everyone else -- or what that not be sound ?
11:19 AM 
would that not be sound
Heather: I think that would be ok
anything else?
11:20 AM 
Nicholas: I dont think so
11:21 AM 
Heather: yup, seems like a good list to me
11:22 AM 
so the variable we are regressing on, our dependant variable, is "does the journal have a data sharing policy that requests or requires data sharing" aka requests (or some other name?)
and the indep variables are:
subscription model
society affiliation
did I get them all?
11:23 AM 
Nicholas: its probably not significant to consider "instructions on how to cite data"
because there are so few instances?
Heather: yes, and it isn't exactly clear how it relates to the purpose of this analysis
well, that is too strong
Nicholas: oh, yes that too :)
11:24 AM 
Heather: it isn't quite related to the purpose of this analysis
I think ideally we would do another, very similar analysis
where instead of predicting whether the journal had a polciy to request data sharing, we look at whether they have a policy about how to cite data
alas we may not have enough +ve case to have much of a story
11:25 AM 
another "ideally we would do" is unlump requests from requires in data sharing policy
I think let's put that one on the backburner for today
but keep it in mind, write it down as a todo for tomorrow or something
Nicholas: ok
and to do so
Heather: does that all make sense?
11:26 AM 
yes, so mostly I don't think I've shown you the tools yet to do requests vs requires well
Nicholas: ok
yes that makes sese
Heather: it is a step more complicated than logistic regression
so let's hold off till we nail this one :)
11:27 AM 
Nicholas: good plan
Heather: btw, you do realize that you are currently doing Stats 301 or something, right? I mean if you feel lost ever, it is with good reason.
you are holding on well, nic.
keep askign if/when/as it gets confusing
11:28 AM 
Nicholas: oh thanks, I don't know about that but thanks...I've already lined up a stats tutor for next semester
Heather: and I'm guessing you'll feel all sorts of Ahhhh moments next term as you realize in retrospect all the things we are doing :)
Nicholas: fingers crossed
Heather: and no doubt some urggg moments as you realize some of the steps we are skipping, but oh well, c'est la vie :)
11:29 AM 
Nicholas: quick question
Heather: yes?
Nicholas: have you taught stats before?
Heather: nope
I haven't taught anything before, actuall
Nicholas: well, you're a patient instructor you'd thrive
11:30 AM 
Heather: that is kind of you. I do like it. I wish I'd done it before, though, clearly more practice about what order to talk about things, what exampes to use, etc is useful!
Nicholas: you can cut your teeth and google chat, and then take your refined skills to the classroom
11:31 AM 
Heather: yeah. though it feels like it would be a whole different world. hard to tell if the students are really with you, no?
Nicholas: yeah, definitely... ok thanks for indulging my curiosity
11:32 AM 
Heather: ok. variables.
I think we can do a bit here and then you'll be equipped to go do a bit more R coding and running on your own
then we can sync up in a few hours again
does that work?
Nicholas: sounds great
Heather: I'm around till 3 my time
11:33 AM 
so let's think about our variables
and what kind of variables they are
impact factor is a real number, right? a float.
Nicholas: yes
Heather: now ideally when you have real numbers in a regression, they have the shape of a bell curve or a normal distribution (same thing)
11:34 AM 
if you do
you can see in the graph that it doesn't look like a bell curve
Nicholas: no it doesnt
11:35 AM 
Heather: now try this
much more like it, eh?
Nicholas: yes
Heather: do you know enough about logs to know what that is doing?
Nicholas: no
Heather: ok no problem
11:36 AM 
ok, I'm not going to explain it in much detail
it is actually pretty cool, useful, prevalent
but right now we are just going to treat it as a black box
Nicholas: thats fair
11:37 AM 
Heather: it is a "transform"
11:38 AM 
that takes the number 1 and turns it into 0
so an IF of 1 has a log(IF) of 0
it takes numbers between 0 and 1 and turns them into negative numbers... more negative the smaller they are
11:39 AM 
and takes numbers bigger than 1 and turns them into positive numbers
but it shrinks the scale down such that the logs of really big numbers aren't that much larger than the logs of medium-big numbers
11:40 AM 
make enough sense?
Nicholas: yes
Heather: so we are taking the log of all of the values in our impact factor
and putting those into our model instead of the impact factor
11:41 AM 
and that will make the "fit a best fit line" math be more robust
becaues that math assumes that the data coming in has a normal distribution
11:42 AM 
the only tricky part is that it can't handle impact factors of 0
I'm guessing 0 is actually NA in this case?
Nicholas: right, I noticed that yesterday, there shouldn't be an impact factor of 0... it should be NA right?
Heather: yes
11:43 AM 
for now, I'll work around it by just adding 0.1 to it as a temporary hack
so to add the log of the IF in the model, you do this:
mylogit = glm(requests~log(0.1 + Impact.Factor) + noOA, family=binomial(link="logit"), na.action=na.pass)
11:44 AM 
it does make the coefficient a bit harder to interpret
(though I think if you replace the 0 with an NA and take out the +0.1 hack it will get easier)
11:45 AM 
but you can see that now the noOA variable is a bit more signifcant
so still not sig at p<0.05 .... but more so
modelling your data better can show things that are otherwise hidden
11:47 AM 
so let's leave impact factor at that for now
subscription model
it has three levels now, right?
Nicholas: yes
11:48 AM 
Heather: I think maybe for the purposes of this analysis we might want to collapse it into two
to save "the number of variables"
and make it easier to interpret
I'll leave it up to you to decide what to collapse
this will make it a binary variable
something like
hassomeOA = 1 or 0
11:49 AM 
depends on what you decide
Nicholas: ok
Heather: ok, categories
there are how many main categories?
and can journals belong to more than one?
11:50 AM 
Nicholas: yes
I was just intereted in the three categories we originally used to gather this list
Heather: yes, cool
Nicholas: Ecology, EvoBio and Env Sci
11:51 AM 
Heather: then a binary variable for each of those?
since some journals are members of more than one?
Nicholas: ... I guess so
Heather: ok. it is true that some journals are members of more than one, right?
Nicholas: yes
11:52 AM 
Heather: (btw I'm using the word binary. it just means two-valued. usually 0 and 1.
ok cool
then publisher
pick the big 3? the big 4?
you still there?
11:53 AM 
for publisher, pick the big 3 or 4 and call the others "Other"
Hi Nic. I think we lost the connection. are you still there?
11:54 AM 
Nicholas: yes
ok, publisher
Heather: pick the big 3 or 4 and call the others "Other"
Nicholas: ok
Heather: have them all in one column. so not an is.Wiley column, but instead a
11:55 AM 
publisher_code column or something
that has one of four values in it
"Wiley", "Elsevier", "Other" or whatever
we'll need to work together to figure out how to interpret the results once you get this column
11:56 AM 
the society affiliation column is a binary one, just 0 or 1?
Nicholas: I have both a yes/ no and a numeric 1 - 0
Heather: great!
Nicholas: for affiliaiton
Heather: ok, that's all of them?
so if you can get the data in that format,
write it up on your OWW page,
11:57 AM 
try to put them all in the model, separated by + signs
Nicholas: ok
Heather: then we'll see what we have :)
Nicholas: ok
Heather: remember to hold off interpretations still.... because doing data analysis takes several attepts to get right
esp when still a newbie (and I count as that too)
thing slike realizing, duh my impact factor woudl really be better as a log transform
11:58 AM 
can change your results and thus your interpretations
so treat it all as a work in progress
Nicholas: ok
Heather: until we are convinced that the data is a good representation, doesn't have mistakes, the model is a good fit, etc etc etc
11:59 AM 
I'll probably be away from chat for a bit, but then will be back
Nicholas: ok. I'll be here
Heather: for sure let's talk again in an hour or two....
Nicholas: until about 5 my time, then I have a conf to go to
Heather: ok.
12:00 PM 
enjoy the conf, cool.
see you in a bit! (oh, btw, I'm assuming you are adding this to an .R file and putting it in git....
if not, woudl be a good time now....
have you figured out how to add new revisions to a git?
to a gist, I mean?
12:01 PM 
that way you can keep the same ID number and just update the contents
Nicholas: I've got a text fiel that I'll put into a gist
Heather: yeah great
taht would probably be an easy way for me to see/run your code
play with the revisions stuff till you figure it out or ask.
ok, later!
12:02 PM 
Nicholas: bye thanks again
Heather: my pleasure!

Personal tools