DataONE:Notebook/Summer 2010/2010/07/19

From OpenWetWare

< DataONE:Notebook | Summer 2010 | 2010 | 07
Revision as of 19:03, 28 July 2010 by Nic Weber (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search
Heather: Hi Nic!
me: Hi Heather
Heather: Now a good time for talking R?
me: sure
Heather: kudos btw on getting the submission off on Friday
9:59 AM 
me: do you want to use the desktop sharing application I have or just do it via chat?
Heather: maybe chat, and then if we do desktop sharing maybe I'll try Adobe Connect
me: oh no thank you and todd for your input
Heather: a few other projects I am on use that, so probably would behoove me to get more familiar with it
10:00 AM 
you have R on the computer where you are?
me: we have this suite of GoTo for my program.. it's pretty awesome
very easy to use, but I think it's pricey
Heather: do you a dataset you'd like to work with first?
me: yes
Heather: good to know. yeah, mostly pricey isn't my thing these days :)
me: I got the dataset into R using the command you gave me for datraw
Heather: online? want to point me to it?
10:01 AM 
let me pull that up
can you run all the example commands sucessfully, does it look like?
me: I couldn't get the graphs to appear
Heather: hrm ok
are you on a mac?
me: when I tried to plot something nothing showed up
10:02 AM 
should I send you the .csv?
Heather: running the normal R program (as opposed to some variant like JGR or sometihng)?
nah, I got it.
me: R version 2.11.1
Heather: does it complain about not plotting, or it just silently does nothing?
10:03 AM 
me: Im in the console
oh wow
Heather: I'm version 2.10.1
shouldn't matter much though
me: actually it just magically worked
Heather: oh wow?
oh good
the magic of someone else watching, eh?
me: ha
10:04 AM 
Heather: great. ok, do the plots and just make sure they all seem to work....
10:06 AM 
me: the plots seem to work
Heather: great.
me: the abline did not
Heather: and they make a bit of sense?
or do you want to ask what they are about?
hrm, don't know why not
me: I think these make a bit of sense
Heather: oh well, that isn't very relevant for now, abline
me: abline is ?
10:07 AM 
Heather: yeah, it plots a best fit line
on top of the scatter plot
mostly it was just showing off some of what could be done
me: I've been reading a begginers guide to R this morning, but not up to speed obviously
Heather: and giving a gut feel for regression
me: ok
Heather: so let's keep going then
if there is a problem with abline it will probably show up in something else too
so we'll just deal with it then.
10:08 AM 
ok dokee. do you have thoughts about what stats you think you want to do?
or, put another way, do you have ideas about hypotheses you want to test or expore with numbers?
10:09 AM 
me: well, to be honest, I don't... I would like to know most about the journals what kind of correlations we can make between requested or required sharing and impact metrics
Heather: ok.
in that case, I think let's start with the numbers you pulled together for your poster
since those were obviously of interest :)
me: ok
Heather: and we'll move on from there
(this morning, or otherwise soon)
so... you calculated some percentages
10:10 AM 
percentages are useful, but they lack information by themselves
they don't portray how confident that we should be in the fact that the percentages would generalize, right?
so we suppliment them with an n= number
me: right
Heather: to sort of show how big the sample was that we used to calc the percentages
10:11 AM 
obviously if n=100000 we thing the percentages might be more valid/precise/reproducible/etc than if n=10
10:12 AM 
a different way to think about that is that if n=100000, it is really unlikely that the percentage was actually radically different than what we measured
and we just got what we measured by chance
whereas if n=10, it is totally possible that the mean is really different and we just got what we got by chance
10:13 AM 
one way to use the n and the percentage to encapsulate this idea
is to calculate confidence intervals around the percentages
have you heard of them? do you understand them, a lot , a little, not really at all?
10:14 AM 
me: heard of them, do not know what they are though
Heather: ok, no prob
hrm, do you have one of your percentages from your poster handy?
me: sure one sec
Heather: let me pull up your poster draft, and if you could do that too it woudl be great
10:15 AM 
Funding Agencies (53 total evaluated): 44% (23 of 53) require data sharing in some way; 8% (4) specify the duration for which data must be preserved by the primary investigator; 25% (13) give directions on the type of repository to which data should be deposited (while only 2 name particular repositories); 8% (4) provide supplemental funds available for deposition of data; only one gives direction on how such data is to be cited.
Heather: right. great. or for journals, the dataset we have in R right now, "13% (40 of 307) request or require data to be archived;"
so 13% is useful, but we want more info.
we want to know, might it be 12%? 14%?
what about 5%? 25%
10:16 AM 
how confident are we that the actual percentage, if we were able to measure the entire sample that this came from, would be 13%?
so one way we do this is to calculate "confidence intervals"
the most common kind are 95% confidence intervals.
10:17 AM 
if the confidence intervals were, say 8% to 17%, then that would mean
that given the data we have,
our best guess is that if we were to repeat this experiement 100 times
10:18 AM 
only 5% of the time would we estimate the percentage to be lower than 8% or higher than 17%
just by chance/bad luck
this idea takes a while to sink in, so maybe I'll try to say it a few ways.
10:19 AM 
we are 95% sure, given the sample size we used and the data we saw
that the real percentage value in the whole universe that we sampled from is between 8 and 17%
it is possible it is lower than that or higher than that
in fact it could be anything
but if it were actually lower than that or higher than that
10:20 AM 
it woudl make the sampel that we got really surprising
and extreme surprises don't happen very often
me: ok this makes sense
Heather: so our best guess is that the real percentage is close to what we got
given the same percentage,
10:21 AM 
you'll find that estimates made with small sample sizes end up with wider confidence intervals
me: so just to make sure, is 95% arbitrary or is that a standard?
Heather: ie 3 ti 25% or something
whereas large sample sizes end up with tight confidence intervals: 10 to 15% in this example
95% is standard
it ties directly to the p<0.05 threshold that is also standard,
10:22 AM 
that you'lve likely heard about
me: ok
Heather: they are mostly two ways of measuring the same thing
the amount of surprise/bad luck etc you'd have to get to have to get a sample that was so strange/disproportionate to have given you a weird estimate, if the real actual value was something different
10:23 AM 
the tightness of the confidence intervals also depends on what the percentage is.
but no need to go into that now
for now let me just show you how to caluclate them to you can explore :)
if you do ?binconf in R, what happens?
me: ok
10:24 AM 
No documentation for 'biconf' in specified packages and libraries:
you could try '??biconf'
Heather: ok.
if you do "library(Hmisc)" what happens?
10:25 AM 
me: no package called hmisc
Heather: ok. do you have a "packages & data" menu at the top?
go to package installer in that menu
me: ok
10:26 AM 
Heather: then with CRAN (binaries) selected
search for Hmisc
(caps matter)
then "Get List)
me: 3.8-2?
Heather: do you see an Hmisc package?
me: I mean does the version matter?
Heather: highlight it, check "install dependencies", and click "install selected"
no version doesn't matter much now
10:27 AM 
sometimes it can, but I don't think it will for what we are doing
does it look like it is installing?
me: its done
Heather: great
then type library(Hmisc)
then ?binconf
10:28 AM 
me: ok
its giving me a table
with topic and package and description
Heather: really? hrm, at the prompt?
10:29 AM 
well, try instead at the prompt typing
binconf(40, 307)
me: cant find function biconf
10:30 AM 
Heather: ok. when at the prompt you type
what happens?
me: I get another prompt
Heather: good
then when you type
what happens?
10:31 AM 
me: I get a pop up window that gives me descriptions of conf intervals for binomial probabilities
Heather: great
now at the prompt type
binconf(40, 307)
me: PointEst Lower Upper
0.1302932 0.0971623 0.1725619
Heather: yay
so that there are your confidence intervals
10:32 AM 
me: sorry I am bit slow
Heather: no, that's fine
me: must have mistyped something before
Heather: easy to make typos etc in this way of doing it
no prob
me: so lower upper look to be what you said right?
Heather: so if you look at the popup from ?binconf
me: 9 and 17
Heather: you can see that alpha=0.05
the confidence is 1-alpha, so these are 95% confidence intervals
10:33 AM 
if you wanted to computer 99% confidence intervals
(which would be wider, right, because to be really sure of something
10:34 AM 
you need to include at least what you thought before, plus give more wiggle room for even stranger subsamples)
you would enter
binconf(40, 307, 0.01)
me: PointEst Lower Upper
0.1302932 0.0885323 0.1876962
Heather: but really you only care about 95% confidence intervals, so you can forget that part :)
10:35 AM 
so you could write a text file that computes the confidence intervals for all of the percentages in your poster, for example
that would convey information that is useful for others reading it
me: ok
Heather: they could know, ok, it is pretty unlikely that the percentage is actually as low as 5%, given this evidence.....
10:36 AM 
so that is the first bit of useful stats you can do
if you were to write all of those binconf lines, for example
you could save them in a text file
and give it the extension .R
and then it would be called an R script
and you could run it either by copying and pasting the contents into the prompt
or by typing
10:37 AM 
at the command line
(or some other ways from the menus, I think)
it is useful to make and save these .R files
(especially with loading the dataset at the top)
10:38 AM 
because they document the stats that you ran
and make it very easy and transparent for other people to see what you did
make sense?
me: um, so when you say write out all the binconf, you mean write them in a text editor like emacs right and then load it into R
10:39 AM 
Heather: yes
me: if I save it as an r script
Heather: yes
basically you'd have a lot of lines of bioconf, one for each percentage you calculate
me: so does r allow you to use # for comments
Heather: maybe a varient of this to make it easier to read
round(100*binconf(40, 307), 1)
10:40 AM 
me: that way I could name them
ok good
Heather: I'm not necessarily suggesting that you do this
there are some better ways to do it more easily, perhaps,
but it might be worth doing it for a few to get the hang of it
and a gut feel for how "wide" your confidence intervals are for your different data sets,
10:41 AM 
given the questions that you asked and the differeing sizes of your samples
me: ok so I should just explore a bit with these
Heather: yup
me: ok
10:42 AM 
Heather: let me give you a bit more to chew on before I run to my next eeting
try this
you get a long list of stuff?
me: yup
10:43 AM 
Heather: ok, that can be a good way to have a look at your data
me: whoa this is awesome
Heather: one of the things you will realize as you do that
is that some of your fields aren't coded for easy stat analysis yet
for example,
10:44 AM 
has some random comments in it
which is useful, but you need a cleaner column that that if you want to look at a percentage, right?
10:45 AM 
one way you can get a cleaner column is to make one automatically.
me: right
Heather: for example, you could make a new column called "requested"
that is either TRUE or FALSE
me: so I tried to code these doing 0,1,2
Heather: ok.
let me show you one way do do it useing text matches, in case that is helpful for this or somehting else....
me: creating a new column that said 0 no request 1 request 2 required ( for example)
10:46 AM 
Heather: two steps
first make a new column that is all false
dat.raw$requested = FALSE
then, change just some of them to true, the ones that contain the word "Requested"
dat.raw$requested[grep("+Requested+", dat.raw$] = TRUE
10:47 AM 
then when you type dat.raw$requested
you can see the cleaner column
this may not be ideal for this case, I agree a manual coding of 0,1,2 might be better
but maybe for pulling out various publishers, ISI categories, whatever
one more thing
once you have a binary column
10:48 AM 
= one that is just TRUE/FALSE or 1/0
you can do a nice plot to look at the odds of something
so the percentage time that it happens, in a subset of your data
10:49 AM 
can you run the two dat.raw$requested lines above, first?
dat.raw$requested = FALSE
dat.raw$requested[grep("+Requested+", dat.raw$] = TRUE
me: I got this response-- Error in `$<`(`*tmp*`, "requested", value = c(NA, NA, TRUE, : 
replacement has 34 rows, data has 39
10:50 AM 
I think I need to clean up the .csv file ?
Heather: Hmmmm I wonder if I cleaned up my data a bit more than you did
yeah, I'm guessing so
me: ok
10:52 AM 
Heather: hrm, I was trying to show you something else but out of time to get it to work
I have a 3 hr online meeting now
me: ick
Heather: then could be avail from 2-3 my time
or after 7 my time
or tomorrow.....
me: whatever works best for you
I am going to make sure these are more normalized
Heather: ok, let's play it by ear a bit
10:53 AM 
me: and then play around with what you gave me
Heather: yup, normalizing would help
me: I should be online most of the day though
Heather: great
me: I;ll leave my chat open
Heather: send me email if/when you run into things that are confusing?
me: ok
Heather: ok.
looking forward to getting into more good stuff today and tomorrow :)
10:54 AM 
me: thanks for the help!
Personal tools