DataONE:Notebook/Summer 2010/2010/07/19

From OpenWetWare
Jump to navigationJump to search
Heather: Hi Nic!
 
me: Hi Heather
 
Heather: Now a good time for talking R?
 
me: sure
 
Heather: kudos btw on getting the submission off on Friday
9:59 AM 
me: do you want to use the desktop sharing application I have or just do it via chat?
 
Heather: maybe chat, and then if we do desktop sharing maybe I'll try Adobe Connect
 
me: oh no thank you and todd for your input
  
ok
 
Heather: a few other projects I am on use that, so probably would behoove me to get more familiar with it
10:00 AM 
you have R on the computer where you are?
 
me: we have this suite of GoTo for my program.. it's pretty awesome
  
very easy to use, but I think it's pricey
 
Heather: do you a dataset you'd like to work with first?
 
me: yes
 
Heather: good to know. yeah, mostly pricey isn't my thing these days :)
 
me: I got the dataset into R using the command you gave me for datraw
 
Heather: online? want to point me to it?
  
great
10:01 AM 
let me pull that up
  
can you run all the example commands sucessfully, does it look like?
 
me: I couldn't get the graphs to appear
 
Heather: hrm ok
  
are you on a mac?
 
me: when I tried to plot something nothing showed up
  
yes
10:02 AM 
should I send you the .csv?
 
Heather: running the normal R program (as opposed to some variant like JGR or sometihng)?
  
nah, I got it.
 
me: R version 2.11.1
 
Heather: does it complain about not plotting, or it just silently does nothing?
10:03 AM 
me: Im in the console
  
oh wow
 
Heather: I'm version 2.10.1
  
shouldn't matter much though
 
me: actually it just magically worked
 
Heather: oh wow?
  
oh good
  
the magic of someone else watching, eh?
 
me: ha
10:04 AM 
Heather: great. ok, do the plots and just make sure they all seem to work....
10:06 AM 
me: the plots seem to work
 
Heather: great.
 
me: the abline did not
 
Heather: and they make a bit of sense?
  
or do you want to ask what they are about?
  
hrm, don't know why not
 
me: I think these make a bit of sense
 
Heather: oh well, that isn't very relevant for now, abline
 
me: abline is ?
10:07 AM 
Heather: yeah, it plots a best fit line
  
on top of the scatter plot
  
mostly it was just showing off some of what could be done
 
me: I've been reading a begginers guide to R this morning, but not up to speed obviously
 
Heather: and giving a gut feel for regression
 
me: ok
 
Heather: so let's keep going then
  
if there is a problem with abline it will probably show up in something else too
  
so we'll just deal with it then.
10:08 AM 
ok dokee. do you have thoughts about what stats you think you want to do?
  
or, put another way, do you have ideas about hypotheses you want to test or expore with numbers?
10:09 AM 
me: well, to be honest, I don't... I would like to know most about the journals what kind of correlations we can make between requested or required sharing and impact metrics
 
Heather: ok.
  
in that case, I think let's start with the numbers you pulled together for your poster
  
since those were obviously of interest :)
 
me: ok
 
Heather: and we'll move on from there
  
(this morning, or otherwise soon)
  
so... you calculated some percentages
10:10 AM 
percentages are useful, but they lack information by themselves
  
they don't portray how confident that we should be in the fact that the percentages would generalize, right?
  
so we suppliment them with an n= number
 
me: right
 
Heather: to sort of show how big the sample was that we used to calc the percentages
10:11 AM 
obviously if n=100000 we thing the percentages might be more valid/precise/reproducible/etc than if n=10
10:12 AM 
a different way to think about that is that if n=100000, it is really unlikely that the percentage was actually radically different than what we measured
  
and we just got what we measured by chance
  
whereas if n=10, it is totally possible that the mean is really different and we just got what we got by chance
10:13 AM 
one way to use the n and the percentage to encapsulate this idea
  
is to calculate confidence intervals around the percentages
  
have you heard of them? do you understand them, a lot , a little, not really at all?
10:14 AM 
me: heard of them, do not know what they are though
 
Heather: ok, no prob
  
hrm, do you have one of your percentages from your poster handy?
 
me: sure one sec
 
Heather: let me pull up your poster draft, and if you could do that too it woudl be great
10:15 AM 
me: FINDINGS
Funding Agencies (53 total evaluated): 44% (23 of 53) require data sharing in some way; 8% (4) specify the duration for which data must be preserved by the primary investigator; 25% (13) give directions on the type of repository to which data should be deposited (while only 2 name particular repositories); 8% (4) provide supplemental funds available for deposition of data; only one gives direction on how such data is to be cited.
 
Heather: right. great. or for journals, the dataset we have in R right now, "13% (40 of 307) request or require data to be archived;"
  
so 13% is useful, but we want more info.
  
we want to know, might it be 12%? 14%?
  
what about 5%? 25%
10:16 AM 
how confident are we that the actual percentage, if we were able to measure the entire sample that this came from, would be 13%?
  
so one way we do this is to calculate "confidence intervals"
  
the most common kind are 95% confidence intervals.
10:17 AM 
if the confidence intervals were, say 8% to 17%, then that would mean
  
that given the data we have,
  
our best guess is that if we were to repeat this experiement 100 times
10:18 AM 
only 5% of the time would we estimate the percentage to be lower than 8% or higher than 17%
  
just by chance/bad luck
  
this idea takes a while to sink in, so maybe I'll try to say it a few ways.
10:19 AM 
we are 95% sure, given the sample size we used and the data we saw
  
that the real percentage value in the whole universe that we sampled from is between 8 and 17%
  
it is possible it is lower than that or higher than that
  
in fact it could be anything
  
but if it were actually lower than that or higher than that
10:20 AM 
it woudl make the sampel that we got really surprising
  
and extreme surprises don't happen very often
 
me: ok this makes sense
 
Heather: so our best guess is that the real percentage is close to what we got
  
great!
  
given the same percentage,
10:21 AM 
you'll find that estimates made with small sample sizes end up with wider confidence intervals
 
me: so just to make sure, is 95% arbitrary or is that a standard?
 
Heather: ie 3 ti 25% or something
  
whereas large sample sizes end up with tight confidence intervals: 10 to 15% in this example
  
95% is standard
  
it ties directly to the p<0.05 threshold that is also standard,
10:22 AM 
that you'lve likely heard about
 
me: ok
 
Heather: they are mostly two ways of measuring the same thing
  
the amount of surprise/bad luck etc you'd have to get to have to get a sample that was so strange/disproportionate to have given you a weird estimate, if the real actual value was something different
10:23 AM 
the tightness of the confidence intervals also depends on what the percentage is.
  
but no need to go into that now
  
for now let me just show you how to caluclate them to you can explore :)
  
if you do ?binconf in R, what happens?
 
me: ok
10:24 AM 
No documentation for 'biconf' in specified packages and libraries:
you could try '??biconf'
 
Heather: ok.
  
if you do "library(Hmisc)" what happens?
10:25 AM 
me: no package called hmisc
 
Heather: ok. do you have a "packages & data" menu at the top?
  
go to package installer in that menu
 
me: ok
10:26 AM 
Heather: then with CRAN (binaries) selected
  
search for Hmisc
  
(caps matter)
  
then "Get List)
 
me: 3.8-2?
 
Heather: do you see an Hmisc package?
  
ypu
 
me: I mean does the version matter?
 
Heather: highlight it, check "install dependencies", and click "install selected"
  
no version doesn't matter much now
10:27 AM 
sometimes it can, but I don't think it will for what we are doing
  
does it look like it is installing?
 
me: its done
 
Heather: great
  
then type library(Hmisc)
  
then ?binconf
10:28 AM 
me: ok
  
its giving me a table
  
with topic and package and description
 
Heather: really? hrm, at the prompt?
10:29 AM 
well, try instead at the prompt typing
  
binconf(40, 307)
 
me: cant find function biconf
10:30 AM 
Heather: ok. when at the prompt you type
  
library(Hmisc)
  
what happens?
 
me: I get another prompt
 
Heather: good
  
then when you type
  
?binconf
  
what happens?
10:31 AM 
me: I get a pop up window that gives me descriptions of conf intervals for binomial probabilities
 
Heather: great
  
now at the prompt type
  
binconf(40, 307)
 
me: PointEst Lower Upper
0.1302932 0.0971623 0.1725619
 
Heather: yay
  
so that there are your confidence intervals
10:32 AM 
me: sorry I am bit slow
 
Heather: no, that's fine
 
me: must have mistyped something before
 
Heather: easy to make typos etc in this way of doing it
  
no prob
 
me: so lower upper look to be what you said right?
 
Heather: so if you look at the popup from ?binconf
 
me: 9 and 17
 
Heather: you can see that alpha=0.05
  
the confidence is 1-alpha, so these are 95% confidence intervals
10:33 AM 
if you wanted to computer 99% confidence intervals
  
(which would be wider, right, because to be really sure of something
10:34 AM 
you need to include at least what you thought before, plus give more wiggle room for even stranger subsamples)
  
you would enter
  
binconf(40, 307, 0.01)
 
me: PointEst Lower Upper
0.1302932 0.0885323 0.1876962
 
Heather: but really you only care about 95% confidence intervals, so you can forget that part :)
  
yup
10:35 AM 
so you could write a text file that computes the confidence intervals for all of the percentages in your poster, for example
  
that would convey information that is useful for others reading it
 
me: ok
 
Heather: they could know, ok, it is pretty unlikely that the percentage is actually as low as 5%, given this evidence.....
  
etc
10:36 AM 
so that is the first bit of useful stats you can do
  
if you were to write all of those binconf lines, for example
  
you could save them in a text file
  
and give it the extension .R
  
and then it would be called an R script
  
and you could run it either by copying and pasting the contents into the prompt
  
or by typing
10:37 AM 
source(yourfilename.R)
  
at the command line
  
(or some other ways from the menus, I think)
  
it is useful to make and save these .R files
  
(especially with loading the dataset at the top)
10:38 AM 
because they document the stats that you ran
  
and make it very easy and transparent for other people to see what you did
  
make sense?
 
me: um, so when you say write out all the binconf, you mean write them in a text editor like emacs right and then load it into R
10:39 AM 
Heather: yes
 
me: if I save it as an r script
  
ok
 
Heather: yes
  
basically you'd have a lot of lines of bioconf, one for each percentage you calculate
 
me: so does r allow you to use # for comments
 
Heather: maybe a varient of this to make it easier to read
  
round(100*binconf(40, 307), 1)
  
yes
10:40 AM 
me: that way I could name them
  
ok good
 
Heather: I'm not necessarily suggesting that you do this
  
there are some better ways to do it more easily, perhaps,
  
but it might be worth doing it for a few to get the hang of it
  
and a gut feel for how "wide" your confidence intervals are for your different data sets,
10:41 AM 
given the questions that you asked and the differeing sizes of your samples
 
me: ok so I should just explore a bit with these
 
Heather: yup
 
me: ok
10:42 AM 
Heather: let me give you a bit more to chew on before I run to my next eeting
  
try this
  
Hmisc::describe(dat.raw)
  
you get a long list of stuff?
 
me: yup
10:43 AM 
Heather: ok, that can be a good way to have a look at your data
 
me: whoa this is awesome
 
Heather: one of the things you will realize as you do that
  
is that some of your fields aren't coded for easy stat analysis yet
  
for example,
10:44 AM 
dat.raw$Policy.Requested.or.Required.Authors.to.share.data..for.publication
  
has some random comments in it
  
which is useful, but you need a cleaner column that that if you want to look at a percentage, right?
10:45 AM 
one way you can get a cleaner column is to make one automatically.
 
me: right
 
Heather: for example, you could make a new column called "requested"
  
that is either TRUE or FALSE
 
me: so I tried to code these doing 0,1,2
 
Heather: ok.
  
let me show you one way do do it useing text matches, in case that is helpful for this or somehting else....
 
me: creating a new column that said 0 no request 1 request 2 required ( for example)
10:46 AM 
Heather: two steps
  
first make a new column that is all false
  
dat.raw$requested = FALSE
  
then, change just some of them to true, the ones that contain the word "Requested"
  
dat.raw$requested[grep("+Requested+", dat.raw$Policy.Requested.or.Required.Authors.to.share.data..for.publication)] = TRUE
10:47 AM 
then when you type dat.raw$requested
  
you can see the cleaner column
  
this may not be ideal for this case, I agree a manual coding of 0,1,2 might be better
  
but maybe for pulling out various publishers, ISI categories, whatever
  
one more thing
  
once you have a binary column
10:48 AM 
= one that is just TRUE/FALSE or 1/0
  
you can do a nice plot to look at the odds of something
  
so the percentage time that it happens, in a subset of your data
10:49 AM 
can you run the two dat.raw$requested lines above, first?
  
so
  
dat.raw$requested = FALSE
  
then
  
dat.raw$requested[grep("+Requested+", dat.raw$Policy.Requested.or.Required.Authors.to.share.data..for.publication)] = TRUE
 
me: I got this response-- Error in `$<-.data.frame`(`*tmp*`, "requested", value = c(NA, NA, TRUE, : 
replacement has 34 rows, data has 39
10:50 AM 
I think I need to clean up the .csv file ?
 
Heather: Hmmmm I wonder if I cleaned up my data a bit more than you did
  
yeah, I'm guessing so
 
me: ok
10:52 AM 
Heather: hrm, I was trying to show you something else but out of time to get it to work
  
I have a 3 hr online meeting now
 
me: ick
 
Heather: then could be avail from 2-3 my time
  
or after 7 my time
  
or tomorrow.....
 
me: whatever works best for you
  
I am going to make sure these are more normalized
 
Heather: ok, let's play it by ear a bit
10:53 AM 
me: and then play around with what you gave me
 
Heather: yup, normalizing would help
 
me: I should be online most of the day though
 
Heather: great
 
me: I;ll leave my chat open
 
Heather: send me email if/when you run into things that are confusing?
 
me: ok
 
Heather: ok.
  
looking forward to getting into more good stuff today and tomorrow :)
  
bye!
10:54 AM 
me: thanks for the help!
  
bye