DataONE:Notebook/Summer 2010/2010/07/27

From OpenWetWare

< DataONE:Notebook | Summer 2010 | 2010 | 07
Revision as of 17:46, 27 July 2010 by Nic Weber (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search
DataONE summer internships 2010 Main project page
Previous entry      
Heather: Hi Nic!
  
Have you got a few minutes to chat stats, or when would be good?
10:52 AM 
me: Hi Heather,
  
now is good
 
Heather: cool
  
I love watching your evolving code, good stuff
  
A few ideas
10:53 AM 
one idea is not to worry right now about what is statistically significant or not
  
no need to call it out of your results
 
me: ok
 
Heather: you are still clearly getting the data flushed out and cleaned, your columns rejiggered.... all of this makes the results change
  
and really the results aren't to be trusted until all of that is done
10:54 AM 
me: yeah, I've spent a lot of time cleaning up columns this morning
 
Heather: and in some ways when you comment on it early it suggests that that there is something worth looking at, when really there isn't yet :)
  
what I mean is, there isn't anything worth looking at yet because it is all still transient
 
me: ok I understand
 
Heather: cool
10:55 AM 
ok, different topic: columns
  
and whether they are separate or together
  
I haven't explained "factors" to you yet, is that right?
 
me: thats right, we haven't gone over that
 
Heather: right. so let's do that a bit right now :)
 
me: ok
 
Heather: there are multiple datatypes for a column
10:56 AM 
binary
  
real/float
  
integer
  
and "category"
  
where category could be favourite colour
  
and if you could only have one favourite colour
  
then it would make sense to have a column called fav color
10:57 AM 
and within it it would say "Red" "Blue" "undecided" etc
  
these are also called nominal variables
  
and also called, in R, "factors"
  
where the levels of the factors are the distinct values it can take
  
make sense so far?
10:58 AM 
to more things about factors
  
one is that they can be ordered
 
me: yes
 
Heather: so for example a journal policy can be "weak" "medium" or "strong"
  
this isn't really an integrer 0, 1, 2
10:59 AM 
because it doesn't make sense to do math on it. strong isn't medium*2
  
but it is more than just a category, because it is ordered
  
so in R this is called an ordered factor
  
and when you have an ordered factor it can help to tell R that because then the stats can use that information
11:00 AM 
ok?
 
me: ok
 
Heather: to do that, you can see the command
  
?ordered
  
or we can just talk about it when it is relevant :)
  
a different conversation about factors is when to put them in the same column, and when to make a bunch of different binary columns
  
if you allow people to have one fav colour, then you should just have one column
11:01 AM 
but if you let people have several fav colours, all of a sudden one column doens't work very well
  
and it works better to have muliple columns, is.red.a.fav is.blue.a.fav that are all binary
  
so..... since a journal can have muliple ISI categories, each of the categories should have their own columns
11:02 AM 
but since they can only have one publisher, it makes most sense for the publisher to stay in a single column that has muliple factor "levels"
  
that helps to interpret the stats
  
I'll show you how to do that.
  
make sense as a concept?
11:03 AM 
me: yes I think so
 
Heather: any questions about it? you seem a bit unsure?
11:04 AM 
me: no I think I get it
 
Heather: ok.
  
so I think using your PubCode variable in the analysis directly woudl probably work, woudl it?
  
how many different values can it take?
11:05 AM 
me: four
  
other, elsevier, wiley, springer
 
Heather: and taylor? or not?
 
me: well I was finding that I had too many variables, so I collapsed taylor
  
into Other
11:06 AM 
Heather: gotcha
  
ok, so I think if you could rerun a glm including PubCode and post its results, I think we could go through them and I could show you how to interpret them.
11:07 AM 
one command that I've never used but I think would be helpful is relevel
  
it tells R which level to use as the basis, the reference level
  
I think your results would be most interpreable if that was "Other"
11:08 AM 
so I think (but am not sure) that the following code will work:
  
relevel(PubCode, ref="Other")
  
you'd put it right before the table(PubCode) command, before the glm call
11:09 AM 
let me know?
 
me: ok just one sec
11:13 AM 
ok, I just posted it http://www.openwetware.org/wiki/DataONE:Notebook/Data_Citation_and_Sharing_Policy/2010/07/27#Cleaner_Analysis
11:17 AM 
Heather: ok, so it does still have a taylor in it, is that right?
11:18 AM 
me: shoot I'm sorry
  
I called the wrong file in
11:19 AM 
Heather: also, it looks like this line has an error, an extra ] at the end?
  
> Afil = ifelse(Affiliation.Code > 0, 1, 0)] # Society Affiliation 
Error: unexpected ']' in "Afil = ifelse(Affiliation.Code > 0, 1, 0)]"
11:21 AM 
Nic, I think I made a mistake... I think you actually have to make it
  
PubCode = relevel(PubCode, ref="Other")
  
just
11:22 AM 
relevel(PubCode, ref="Other")
  
isn't enough....
  
it has to be assigned back to the PubCode variable
  
I'm learning too, clearly :)
 
me: ok
  
let me fix that and the Afil
11:23 AM 
Heather: sorry about not seeing it before. your results up on your OWW page helped me figure it out :)
  
ok
7 minutes
11:30 AM 
me: It might take me a few more minutues, I don't know why but it keeps showing Taylor in PubCode
 
Heather: ok, no prob
12 minutes
11:42 AM 
me: Ok, I posted what I ran in OWW-- I think there is a problem somewhere though
11:43 AM 
http://openwetware.org/wiki/DataONE:Notebook/Data_Citation_and_Sharing_Policy/2010/07/27#Cleaner_Analysis
11:45 AM 
Heather: what makes ou think that?
11:46 AM 
the fact that there is no PubCodeother in the results is actually a good thing, in case that was it....
11:47 AM 
the reason that is true, is that "other" is used as the base case or the reference
  
so... to interpret these other factors,
  
using the "exp(confint(mylogit))" results
  
PubCodeelsevier 1.69733323 11.6908318
11:48 AM 
means that, compared to "other" publishers (= ones coded as other), journals published by elsevier are 1.7 to 11.7 times as likely to have a data sharing policy
  
whereas
  
PubCodespringer 0.15399592 2.6087098
11:49 AM 
means that being published by springer, a journal is between 0.15 and 2.6 times as likely to have a data sharing policy.
  
(since this goes from less than 1 to more than 1, it doesn't actually tell us anything interesting.... not coincidentally... the pvalue for PubCodespringer is large!)
  
make any sense?
11:50 AM 
me: yes
 
Heather: cool
  
ok, any quick questions before we zoom over to the group chat?
11:51 AM 
I'm going to ask you and Sarah both (Valerie wil lbe joining a bit later, hopefully) to give everyone a brief rundown
  
on what you've been doing and what your plans are.
  
that sound ok?
 
me: sure
  
no other questions right now
 
Heather: great!
11:52 AM 
you relatively comfortable with understanding the statistics you are running right now? on a scale of 0 to 10?
 
me: 6
 
Heather: nice. good.
11:53 AM 
me: I wouldn't say I totally get it, but it makes more sense when I re read our conversations
 
Heather: keep asking if there are things that you'd like to talk through some more.
  
yup, makes sense.
  
great!
  
ok, off to hopefully try to join everyone in. wish us luck and strong connections!
 
me: ok


Personal tools