DataONE:Notebook/Summer 2010/2010/07/26

From OpenWetWare

Jump to: navigation, search
DataONE summer internships 2010 Main project page
Previous entry      Next entry

Search this Project

Customize your entry pages

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.

Home        People        Research        Summer 2010        Resources       


Chat, Heather and Nic


From: Nicholas Weber <nicholas.m.weber@gmail.com>
Date: 2010/7/26
Subject: Chat with Nicholas Weber
To: hpiwowar@gmail.com


9:53 AM 
Heather: Hi Nic, how's it going?
19 minutes
10:12 AM 
Nicholas: Hi
10:13 AM 
It's ok, I tried what you suggested by making the publishers into categories, but nothing they did not have a p-value <.05
7 minutes
10:21 AM 
Heather: Hi Nic
  
No problem
  
at this point all p-values should be taken with a grain of salt anyway
  
because it is not clear that we have all the relevant variables in our equation
  
do you have some more time now or later?
10:22 AM 
thinking we can go over how to compute predicted probabilites, and then
  
what variables should be in your equation :) What do you think
  
?
  
do you have other questions first, or ?
  
(oh, probably lunch time for you, eh? tell me when will be a good time)
10:23 AM 
Nicholas: hi sorry
  
no I was just in R trying to figure something out
10:24 AM 
now is good, or whenever, I'll turn my volume on so I hear it if you ping me
 
Heather: ok now it good for me too
 
Nicholas: one quick question
 
Heather: yes
10:26 AM 
Nicholas: when I was trying to create a variable out of the is.Ecology
  
i kept getting NA's instead of O's
 
Heather: ok, what is the code you are trying
 
Nicholas: I was doing : > is.Ecology=0
> 
> is.Ecology[sub("*Ecology*", ISI.Category)] = 1
  
nope
  
sorry
  
grep instead of sub
10:27 AM 
is.Ecology[grep("*Ecology*", ISI.Category)] = 1
  
is set is.Ecology to 0 initially
10:28 AM 
Heather: ok, what is your exact code?
  
are you doing
 
Nicholas: is.Ecology=0
  
is.Ecology[grep("*Ecology*", ISI.Category)] = 1
10:29 AM 
but then when I try is.Ecology I get NA instead of 0
 
Heather: yes, ok so I think this part is a bit tricky
  
the problem is that when you say
 
Nicholas: so when I try to do "mylogit = glm(requests~Impact.Factor + is.Ecology, family=binomial(link="logit"), na.action=na.pass)"
 
Heather: is.Ecology=0
 
Nicholas: I get an error
 
Heather: it doesn't know if you want that to just be a number, or a "vector" or what
10:30 AM 
Nicholas: oh
 
Heather: when you say mydataframe$is.Ecology = 0
  
then it knows that you want every row in your data frame to have a 0 in it
  
but when you use the attach
  
then add a new variable
  
R doesn't know if you want that to be a "data frame" sort of variable, or just a single number 0, or what
  
so it defaults into a single number 0, I think
10:31 AM 
there are a few fixes
 
Nicholas: shoo,t I think I remember reading that in the intro too
  
ok, sorry
 
Heather: is.Ecology[!grep("*Ecology*", ISI.Category)] = 0
  
nope sorry
  
I meant to type that into R first to try it out, and it doesn't seem to work
10:32 AM 
ummm hmmm, there are lots of fixes, I'm just trying ot think of the nicest
  
how about is.Ecology = rep(0, length(ISI.Category))
10:33 AM 
yeah, then your line
  
is.Ecology[grep("*Ecology*", ISI.Category)] = 1
  
does that seem to work?
 
Nicholas: yup that worked
 
Heather: rep means repeat
10:34 AM 
so repeat the number 0 many times, the same number of times items in ISI.Category
  
make sense?
  
same number of times AS items in ....
 
Nicholas: yeah
 
Heather: cool
  
ok, probabilities then talking variables
  
to in the tutorial,
10:35 AM 
we'll skip the section about deviance residuals for now
  
and log likelihood
  
and jump to the red part called "using predicted probabilities"
 
Nicholas: ok
10:36 AM 
Heather: try these three lines:
  
newdata1 = data.frame(noOA = c(1), Impact.Factor=c(1, 5, 10, 15, 20, 25))
  
newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response")
  
newdata1
  
you get a table with some probability looking things
  
?
 
Nicholas: yes
10:37 AM 
Heather: ok, so what that means is that I gave our "model", our regression equation, the thing stored in mylogit some example datapoints
  
I gave it datapoints where noOA is 1 and Impact.Factor is several possibilities
  
and it used the coefficients it estimated and plugged in those values
10:38 AM 
and computed the probability that a datapoint like that would have a sharing policy
  
so for a journal that is not OA and has an IF of 1, the probability is just 8%
10:39 AM 
but for a journal that is not OA and has an IF of 15, the probability is 60%
  
how try these three lines:
  
newdata1 = data.frame(noOA = c(0), Impact.Factor=c(1, 5, 10, 15, 20, 25))
  
newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response")
  
newdata1
  
this is the same, but noOA = 0... so in other words, it is a journal that has some OA
10:40 AM 
for a jorunal with some oa and an IF of 1, it has a 14% prob of data sharing policy, and with an IF of 15 it has a 72% probability
  
make sense?
  
it is just a different way to interpret what the slopes/coefficients are actually telling us :)
10:41 AM 
Nicholas: ok
 
Heather: questions?
10:42 AM 
Nicholas: I think I added a column so I'm seeing something different
  
one sec
 
Heather: oh I see, yeah that woudl be confusing
10:43 AM 
here is the orig that I'm using:
  
mylogit = glm(requests~Impact.Factor + noOA, family=binomial(link="logit"), na.action=na.pass)
  
then
10:44 AM 
newdata1 = data.frame(noOA = c(1), Impact.Factor=c(1, 5, 10, 15, 20, 25))
  
newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response")
  
newdata1
  
then do you get the first line ending with
  
0.08497077
 
Nicholas: yup
  
ok got it now
 
Heather: cool
10:45 AM 
ok, so for no OA, IF=1, prob of plan = 8%
  
no OA, IF 15, prob of plan = 60%
  
make more sense now?
 
Nicholas: yup
10:46 AM 
as IF goes up, probablity of a data sharing plan increases if no OA
 
Heather: right. now what about if some OA?
  
then would try it again with the noOA variable set to 0 like this:
  
newdata1 = data.frame(noOA = c(0), Impact.Factor=c(1, 5, 10, 15, 20, 25))
10:47 AM 
newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response")
  
newdata1
  
so as IF goes up, probablity of a data sharing plan increases if some OA, also
  
and the overall level is higher, right?
 
Nicholas: yes, increase is more dramatic
10:48 AM 
Heather: yes. looked at a different way, you could also be tempted to say, based on this data
  
at an IF of 15, prob of plan if noOA is 60% and prob of plan if some OA is 72%
10:49 AM 
but, you want to keep in mind when you say that that the noOA variable is pretty unclear in our model
  
its p-value was higher than 0.05
  
its confidence intervals were wide, etc
10:50 AM 
so while there is a coefficient calculated and the table of probabilities uses that coefficient, we want to be careful about saying things about trends in the noOA variable
  
whereas we feel quite confident about saying things about trends in IF because its p-value was signifiant
  
see how the p-value and confidence interval interpretations need to be at the back of our mind as we play with the numbers?
  
confusing, or ok?
10:51 AM 
Nicholas: no I understand, but...this might be a naive question, why would we continue to play with the numbers if we know our p value and conf int aren't sig?
 
Heather: yeah, it is a good question
10:52 AM 
some people woudl advocate for dropping noOA out of our model when we realize it isn't signifcant
  
and rerunning our statistics
  
and calculating our proabililities based on the streamlined model
10:53 AM 
that said, the 5% threshold is in some ways arbitrary
  
and so others say we should keep the nonsignificant things in anyway, because we don't actually know enough to take them out
  
so ???. it depends.
10:54 AM 
mostly, whether you keep them in or take them out, you need to be very hedgy whenever you say anything about them
  
does that help?
 
Nicholas: yes
10:55 AM 
so when you present stats like that, if you choose to, how do you present them
  
by present I mean write about them
 
Heather: yup
  
that's one of the things I love about the tutorial....
10:56 AM 
"These findings can also be interpreted using predicted probabilities. With all other variables held constant at their mean, the probability of admission for a gpa of 2.0 was .15, while a gpa of 3.0 resulted in a .26 probability of admission and a gpa of 4.0 was associated with a .40 probability of admission. Likewise, for gre scores of 400, 500, 600 and 700, the probabilities of admission were .22, .26, .31 and .37, respectively, while holding other predictors constant at their mean."
  
to put that in our language:
10:57 AM 
For jorunals with no OA content, for IF values of 1, 5, 10, ... , the probabilities of having a data shairng plan were x, w, sd, respectively
  
Likewise, for journals with OA content, .....
10:58 AM 
And in the first sentence I would say:
10:59 AM 
A logit regression was used to predict whether a journal had a data sharing plan from the journal's impact factor and whether or not it publishes any articles open access. IF was a significant predictor of
  
having a data sharing plan, but publishign OA content was not statistically significantly related to having a data sharing plan.
  
does that help?
11:00 AM 
Nicholas: yes, thats great.
 
Heather: cool. yeah, I get how having it very applied to your own case can help a lot :)
  
ok, so one more caveat I want to emphasize before we get start planning analyses
11:01 AM 
it can be tempting to really run with this stat analysis stuff
  
but important to remember its limitations
  
lots of limitations, but the one I'll emphasize right now is that
  
it is correlation not causation
  
so it isn't necesarily true that having a high IF CAUSES the journal to have a data hsaring plan
11:02 AM 
Nicholas: right
 
Heather: it could be that having a data sharing plan CAUSES it to have a high IF (ok, unlikely, but still)
  
or having a high IF and having a data sharing plan are both correlated to being published out of England, or something, where they are good publishers and they love data sharing or something
11:03 AM 
anyway, just want to reiterate that so that you can practice writing things up accordingly
 
Nicholas: ok
 
Heather: it can be tempting to say "increases" but often more appropriate to say
  
"is associated with increased" or something
11:04 AM 
disputes can be had about which is better scientific writing, and don't second guess yourself much as you are writing this up
  
jsut want to highlight the issue
  
ok?
 
Nicholas: thats very helpful
 
Heather: ok
11:05 AM 
now.
  
what variables do you think you want to look at?
  
we'll think about what variables, and how they will be coded
  
one detail is that you can't include oodles of variables
  
because you only have so much data
  
you have 307 datapoints, right?
 
Nicholas: right
11:06 AM 
Heather: so opinions differ, but a rule of thumb is to have about 30 datapoints for every coefficient you are trying to estimate
  
so that would be about 10 coefficients
  
and depending on your variables, that is probably fewer than 10 variables.
11:07 AM 
if, for example, you had a publisher column with publisher A, B, C, and other, that is actually 3 coefficients (4 minus one for the base case)
  
so you can see you can spend the 10 coefficients really fast
 
Nicholas: yes
 
Heather: so don't get tooo hooked up on the 10 right now
  
but it does mean we need to focus on what is important
11:08 AM 
and probably leave most of the "nice to haves" out
  
should I give you a bit more background about why you can't have 42 variables in the equation, or is that enough info on that for now?
11:09 AM 
Nicholas: um, I think we can probably move on-- I don't the why, but it makes sense that you can't calculate that many variables for a limited amount of datapoints
  
don't know the why that should read
 
Heather: yeah, good. It doesn't make sense to go into the why on everyting today, that is for sure :)
11:10 AM 
ok.
  
so what would be the journal variables at the top of your list? and/or what ones do you think are not important?
 
Nicholas: ok, so as far as significant variables, IF, Requested / Required are definitely important
 
Heather: where "not important" is "not important to test"
 
Nicholas: I think the categories, Ecology, Env Sci and EvoBio probably should be as well
11:11 AM 
but I don't know that
  
we've already seen that the subscription model probably isn't sig.
 
Heather: Nope!
  
we haven't see that yet :)
  
we just played around with it in some test stats
  
but those were done for learning
  
and not for interpreting yet
11:12 AM 
so don't read anything into them please :)
 
Nicholas: the "has instructions how to cite data" has a really small number of observations
 
Heather: it could be that when you include subscription model in an equation that also has the ISI categories the subscription model is actually relevant, for example.....
 
Nicholas: Affiliation, would be interesting to have as well
11:13 AM 
Heather: where Affiliation woudl be "does it have an affiliation?"
 
Nicholas: yes, a society affiliation
11:14 AM 
Heather: ok, so is it affiliated to a society
  
btw do you get what I was saying about subscription model?
11:15 AM 
given that we don't know anything about it yet, do you think it is a must have in the initial analysis?
 
Nicholas: yes, just because it wasn't sig for what we just ran, doesn't mean that its not signficant in all cases
 
Heather: right... but even more....
 
Nicholas: I do
11:16 AM 
Heather: we were just running things unofficially, as learning
 
Nicholas: right we can't discard it just because we know it to not be sig right now
 
Heather: at the risk of belabouring the point, I'm going to say.....
11:17 AM 
" because we know it to not be sig right now" nope... we DON'T know it to be not sig right now
  
we were playing with some data, but we weren't being careful yet, and so our stats inputs and methods probably weren't all proper yet, and so we don't actually know anything yet
11:18 AM 
Nicholas: right... yes, I think subscription model is important and it should be included
 
Heather: ok, cool.
 
Nicholas: could we divide the publishing group into a Elsevier, Wiley, Springer, Taylor vs everyone else -- or what that not be sound ?
11:19 AM 
would that not be sound
 
Heather: I think that would be ok
  
anything else?
11:20 AM 
Nicholas: I dont think so
11:21 AM 
Heather: yup, seems like a good list to me
11:22 AM 
so the variable we are regressing on, our dependant variable, is "does the journal have a data sharing policy that requests or requires data sharing" aka requests (or some other name?)
  
and the indep variables are:
  
IF
  
subscription model
  
categories
  
publisher
  
society affiliation
  
did I get them all?
11:23 AM 
Nicholas: its probably not significant to consider "instructions on how to cite data"
  
because there are so few instances?
 
Heather: yes, and it isn't exactly clear how it relates to the purpose of this analysis
  
well, that is too strong
 
Nicholas: oh, yes that too :)
11:24 AM 
Heather: it isn't quite related to the purpose of this analysis
  
I think ideally we would do another, very similar analysis
  
where instead of predicting whether the journal had a polciy to request data sharing, we look at whether they have a policy about how to cite data
  
alas we may not have enough +ve case to have much of a story
11:25 AM 
another "ideally we would do" is unlump requests from requires in data sharing policy
  
I think let's put that one on the backburner for today
  
but keep it in mind, write it down as a todo for tomorrow or something
 
Nicholas: ok
  
and to do so
 
Heather: does that all make sense?
11:26 AM 
yes, so mostly I don't think I've shown you the tools yet to do requests vs requires well
 
Nicholas: ok
  
yes that makes sese
  
snese
 
Heather: it is a step more complicated than logistic regression
  
so let's hold off till we nail this one :)
11:27 AM 
Nicholas: good plan
 
Heather: btw, you do realize that you are currently doing Stats 301 or something, right? I mean if you feel lost ever, it is with good reason.
  
you are holding on well, nic.
  
keep askign if/when/as it gets confusing
11:28 AM 
Nicholas: oh thanks, I don't know about that but thanks...I've already lined up a stats tutor for next semester
 
Heather: and I'm guessing you'll feel all sorts of Ahhhh moments next term as you realize in retrospect all the things we are doing :)
 
Nicholas: fingers crossed
 
Heather: and no doubt some urggg moments as you realize some of the steps we are skipping, but oh well, c'est la vie :)
  
ok.
11:29 AM 
Nicholas: quick question
 
Heather: yes?
 
Nicholas: have you taught stats before?
 
Heather: nope
  
I haven't taught anything before, actuall
 
Nicholas: well, you're a patient instructor you'd thrive
11:30 AM 
Heather: that is kind of you. I do like it. I wish I'd done it before, though, clearly more practice about what order to talk about things, what exampes to use, etc is useful!
 
Nicholas: you can cut your teeth and google chat, and then take your refined skills to the classroom
11:31 AM 
Heather: yeah. though it feels like it would be a whole different world. hard to tell if the students are really with you, no?
 
Nicholas: yeah, definitely... ok thanks for indulging my curiosity
11:32 AM 
Heather: ok. variables.
  
I think we can do a bit here and then you'll be equipped to go do a bit more R coding and running on your own
  
then we can sync up in a few hours again
  
does that work?
 
Nicholas: sounds great
 
Heather: I'm around till 3 my time
  
ok
11:33 AM 
so let's think about our variables
  
and what kind of variables they are
  
impact factor is a real number, right? a float.
 
Nicholas: yes
 
Heather: now ideally when you have real numbers in a regression, they have the shape of a bell curve or a normal distribution (same thing)
11:34 AM 
if you do
  
hist(Impact.Factor)
  
you can see in the graph that it doesn't look like a bell curve
 
Nicholas: no it doesnt
11:35 AM 
Heather: now try this
  
hist(log(Impact.Factor))
  
much more like it, eh?
 
Nicholas: yes
 
Heather: do you know enough about logs to know what that is doing?
 
Nicholas: no
 
Heather: ok no problem
11:36 AM 
ok, I'm not going to explain it in much detail
  
it is actually pretty cool, useful, prevalent
  
but right now we are just going to treat it as a black box
 
Nicholas: thats fair
11:37 AM 
Heather: it is a "transform"
11:38 AM 
that takes the number 1 and turns it into 0
  
log(1)
  
so an IF of 1 has a log(IF) of 0
  
it takes numbers between 0 and 1 and turns them into negative numbers... more negative the smaller they are
  
log(0.1)
  
log(0.01)
11:39 AM 
and takes numbers bigger than 1 and turns them into positive numbers
  
but it shrinks the scale down such that the logs of really big numbers aren't that much larger than the logs of medium-big numbers
11:40 AM 
log(10)
  
log(10000)
  
log(100000000)
  
make enough sense?
 
Nicholas: yes
 
Heather: so we are taking the log of all of the values in our impact factor
  
and putting those into our model instead of the impact factor
11:41 AM 
and that will make the "fit a best fit line" math be more robust
  
becaues that math assumes that the data coming in has a normal distribution
11:42 AM 
the only tricky part is that it can't handle impact factors of 0
  
log(0)
  
I'm guessing 0 is actually NA in this case?
 
Nicholas: right, I noticed that yesterday, there shouldn't be an impact factor of 0... it should be NA right?
 
Heather: yes
11:43 AM 
for now, I'll work around it by just adding 0.1 to it as a temporary hack
  
so to add the log of the IF in the model, you do this:
  
mylogit = glm(requests~log(0.1 + Impact.Factor) + noOA, family=binomial(link="logit"), na.action=na.pass)
  
summary(mylogit)
11:44 AM 
it does make the coefficient a bit harder to interpret
  
(though I think if you replace the 0 with an NA and take out the +0.1 hack it will get easier)
11:45 AM 
but you can see that now the noOA variable is a bit more signifcant
  
p=0.078
  
so still not sig at p<0.05 .... but more so
  
modelling your data better can show things that are otherwise hidden
11:47 AM 
ok
  
so let's leave impact factor at that for now
  
subscription model
  
it has three levels now, right?
 
Nicholas: yes
11:48 AM 
Heather: I think maybe for the purposes of this analysis we might want to collapse it into two
  
to save "the number of variables"
  
and make it easier to interpret
  
I'll leave it up to you to decide what to collapse
  
this will make it a binary variable
  
something like
  
hassomeOA = 1 or 0
11:49 AM 
depends on what you decide
  
ok?
 
Nicholas: ok
 
Heather: ok, categories
  
there are how many main categories?
  
and can journals belong to more than one?
11:50 AM 
Nicholas: yes
  
I was just intereted in the three categories we originally used to gather this list
 
Heather: yes, cool
 
Nicholas: Ecology, EvoBio and Env Sci
11:51 AM 
Heather: then a binary variable for each of those?
  
since some journals are members of more than one?
 
Nicholas: ... I guess so
 
Heather: ok. it is true that some journals are members of more than one, right?
 
Nicholas: yes
11:52 AM 
Heather: (btw I'm using the word binary. it just means two-valued. usually 0 and 1.
  
)
  
ok cool
  
then publisher
  
pick the big 3? the big 4?
  
you still there?
11:53 AM 
for publisher, pick the big 3 or 4 and call the others "Other"
  
Hi Nic. I think we lost the connection. are you still there?
11:54 AM 
Nicholas: yes
  
ok, publisher
 
Heather: pick the big 3 or 4 and call the others "Other"
 
Nicholas: ok
 
Heather: have them all in one column. so not an is.Wiley column, but instead a
11:55 AM 
publisher_code column or something
  
that has one of four values in it
  
"Wiley", "Elsevier", "Other" or whatever
  
we'll need to work together to figure out how to interpret the results once you get this column
11:56 AM 
the society affiliation column is a binary one, just 0 or 1?
 
Nicholas: I have both a yes/ no and a numeric 1 - 0
 
Heather: great!
 
Nicholas: for affiliaiton
 
Heather: ok, that's all of them?
  
so if you can get the data in that format,
  
write it up on your OWW page,
11:57 AM 
try to put them all in the model, separated by + signs
 
Nicholas: ok
 
Heather: then we'll see what we have :)
 
Nicholas: ok
 
Heather: remember to hold off interpretations still.... because doing data analysis takes several attepts to get right
  
esp when still a newbie (and I count as that too)
  
thing slike realizing, duh my impact factor woudl really be better as a log transform
11:58 AM 
can change your results and thus your interpretations
  
so treat it all as a work in progress
 
Nicholas: ok
 
Heather: until we are convinced that the data is a good representation, doesn't have mistakes, the model is a good fit, etc etc etc
11:59 AM 
ok!
  
I'll probably be away from chat for a bit, but then will be back
 
Nicholas: ok. I'll be here
 
Heather: for sure let's talk again in an hour or two....
 
Nicholas: until about 5 my time, then I have a conf to go to
  
ok
 
Heather: ok.
12:00 PM 
enjoy the conf, cool.
  
see you in a bit! (oh, btw, I'm assuming you are adding this to an .R file and putting it in git....
  
if not, woudl be a good time now....
  
have you figured out how to add new revisions to a git?
  
to a gist, I mean?
12:01 PM 
that way you can keep the same ID number and just update the contents
 
Nicholas: I've got a text fiel that I'll put into a gist
 
Heather: yeah great
  
taht would probably be an easy way for me to see/run your code
  
play with the revisions stuff till you figure it out or ask.
  
ok, later!
12:02 PM 
Nicholas: bye thanks again
 
Heather: my pleasure!


Personal tools