DataONE:Notebook/Summer 2010/2010/07/19
From OpenWetWare
Jump to navigationJump to search
Heather: Hi Nic! me: Hi Heather Heather: Now a good time for talking R? me: sure Heather: kudos btw on getting the submission off on Friday 9:59 AM me: do you want to use the desktop sharing application I have or just do it via chat? Heather: maybe chat, and then if we do desktop sharing maybe I'll try Adobe Connect me: oh no thank you and todd for your input ok Heather: a few other projects I am on use that, so probably would behoove me to get more familiar with it 10:00 AM you have R on the computer where you are? me: we have this suite of GoTo for my program.. it's pretty awesome very easy to use, but I think it's pricey Heather: do you a dataset you'd like to work with first? me: yes Heather: good to know. yeah, mostly pricey isn't my thing these days :) me: I got the dataset into R using the command you gave me for datraw Heather: online? want to point me to it? great 10:01 AM let me pull that up can you run all the example commands sucessfully, does it look like? me: I couldn't get the graphs to appear Heather: hrm ok are you on a mac? me: when I tried to plot something nothing showed up yes 10:02 AM should I send you the .csv? Heather: running the normal R program (as opposed to some variant like JGR or sometihng)? nah, I got it. me: R version 2.11.1 Heather: does it complain about not plotting, or it just silently does nothing? 10:03 AM me: Im in the console oh wow Heather: I'm version 2.10.1 shouldn't matter much though me: actually it just magically worked Heather: oh wow? oh good the magic of someone else watching, eh? me: ha 10:04 AM Heather: great. ok, do the plots and just make sure they all seem to work.... 10:06 AM me: the plots seem to work Heather: great. me: the abline did not Heather: and they make a bit of sense? or do you want to ask what they are about? hrm, don't know why not me: I think these make a bit of sense Heather: oh well, that isn't very relevant for now, abline me: abline is ? 10:07 AM Heather: yeah, it plots a best fit line on top of the scatter plot mostly it was just showing off some of what could be done me: I've been reading a begginers guide to R this morning, but not up to speed obviously Heather: and giving a gut feel for regression me: ok Heather: so let's keep going then if there is a problem with abline it will probably show up in something else too so we'll just deal with it then. 10:08 AM ok dokee. do you have thoughts about what stats you think you want to do? or, put another way, do you have ideas about hypotheses you want to test or expore with numbers? 10:09 AM me: well, to be honest, I don't... I would like to know most about the journals what kind of correlations we can make between requested or required sharing and impact metrics Heather: ok. in that case, I think let's start with the numbers you pulled together for your poster since those were obviously of interest :) me: ok Heather: and we'll move on from there (this morning, or otherwise soon) so... you calculated some percentages 10:10 AM percentages are useful, but they lack information by themselves they don't portray how confident that we should be in the fact that the percentages would generalize, right? so we suppliment them with an n= number me: right Heather: to sort of show how big the sample was that we used to calc the percentages 10:11 AM obviously if n=100000 we thing the percentages might be more valid/precise/reproducible/etc than if n=10 10:12 AM a different way to think about that is that if n=100000, it is really unlikely that the percentage was actually radically different than what we measured and we just got what we measured by chance whereas if n=10, it is totally possible that the mean is really different and we just got what we got by chance 10:13 AM one way to use the n and the percentage to encapsulate this idea is to calculate confidence intervals around the percentages have you heard of them? do you understand them, a lot , a little, not really at all? 10:14 AM me: heard of them, do not know what they are though Heather: ok, no prob hrm, do you have one of your percentages from your poster handy? me: sure one sec Heather: let me pull up your poster draft, and if you could do that too it woudl be great 10:15 AM me: FINDINGS Funding Agencies (53 total evaluated): 44% (23 of 53) require data sharing in some way; 8% (4) specify the duration for which data must be preserved by the primary investigator; 25% (13) give directions on the type of repository to which data should be deposited (while only 2 name particular repositories); 8% (4) provide supplemental funds available for deposition of data; only one gives direction on how such data is to be cited. Heather: right. great. or for journals, the dataset we have in R right now, "13% (40 of 307) request or require data to be archived;" so 13% is useful, but we want more info. we want to know, might it be 12%? 14%? what about 5%? 25% 10:16 AM how confident are we that the actual percentage, if we were able to measure the entire sample that this came from, would be 13%? so one way we do this is to calculate "confidence intervals" the most common kind are 95% confidence intervals. 10:17 AM if the confidence intervals were, say 8% to 17%, then that would mean that given the data we have, our best guess is that if we were to repeat this experiement 100 times 10:18 AM only 5% of the time would we estimate the percentage to be lower than 8% or higher than 17% just by chance/bad luck this idea takes a while to sink in, so maybe I'll try to say it a few ways. 10:19 AM we are 95% sure, given the sample size we used and the data we saw that the real percentage value in the whole universe that we sampled from is between 8 and 17% it is possible it is lower than that or higher than that in fact it could be anything but if it were actually lower than that or higher than that 10:20 AM it woudl make the sampel that we got really surprising and extreme surprises don't happen very often me: ok this makes sense Heather: so our best guess is that the real percentage is close to what we got great! given the same percentage, 10:21 AM you'll find that estimates made with small sample sizes end up with wider confidence intervals me: so just to make sure, is 95% arbitrary or is that a standard? Heather: ie 3 ti 25% or something whereas large sample sizes end up with tight confidence intervals: 10 to 15% in this example 95% is standard it ties directly to the p<0.05 threshold that is also standard, 10:22 AM that you'lve likely heard about me: ok Heather: they are mostly two ways of measuring the same thing the amount of surprise/bad luck etc you'd have to get to have to get a sample that was so strange/disproportionate to have given you a weird estimate, if the real actual value was something different 10:23 AM the tightness of the confidence intervals also depends on what the percentage is. but no need to go into that now for now let me just show you how to caluclate them to you can explore :) if you do ?binconf in R, what happens? me: ok 10:24 AM No documentation for 'biconf' in specified packages and libraries: you could try '??biconf' Heather: ok. if you do "library(Hmisc)" what happens? 10:25 AM me: no package called hmisc Heather: ok. do you have a "packages & data" menu at the top? go to package installer in that menu me: ok 10:26 AM Heather: then with CRAN (binaries) selected search for Hmisc (caps matter) then "Get List) me: 3.8-2? Heather: do you see an Hmisc package? ypu me: I mean does the version matter? Heather: highlight it, check "install dependencies", and click "install selected" no version doesn't matter much now 10:27 AM sometimes it can, but I don't think it will for what we are doing does it look like it is installing? me: its done Heather: great then type library(Hmisc) then ?binconf 10:28 AM me: ok its giving me a table with topic and package and description Heather: really? hrm, at the prompt? 10:29 AM well, try instead at the prompt typing binconf(40, 307) me: cant find function biconf 10:30 AM Heather: ok. when at the prompt you type library(Hmisc) what happens? me: I get another prompt Heather: good then when you type ?binconf what happens? 10:31 AM me: I get a pop up window that gives me descriptions of conf intervals for binomial probabilities Heather: great now at the prompt type binconf(40, 307) me: PointEst Lower Upper 0.1302932 0.0971623 0.1725619 Heather: yay so that there are your confidence intervals 10:32 AM me: sorry I am bit slow Heather: no, that's fine me: must have mistyped something before Heather: easy to make typos etc in this way of doing it no prob me: so lower upper look to be what you said right? Heather: so if you look at the popup from ?binconf me: 9 and 17 Heather: you can see that alpha=0.05 the confidence is 1-alpha, so these are 95% confidence intervals 10:33 AM if you wanted to computer 99% confidence intervals (which would be wider, right, because to be really sure of something 10:34 AM you need to include at least what you thought before, plus give more wiggle room for even stranger subsamples) you would enter binconf(40, 307, 0.01) me: PointEst Lower Upper 0.1302932 0.0885323 0.1876962 Heather: but really you only care about 95% confidence intervals, so you can forget that part :) yup 10:35 AM so you could write a text file that computes the confidence intervals for all of the percentages in your poster, for example that would convey information that is useful for others reading it me: ok Heather: they could know, ok, it is pretty unlikely that the percentage is actually as low as 5%, given this evidence..... etc 10:36 AM so that is the first bit of useful stats you can do if you were to write all of those binconf lines, for example you could save them in a text file and give it the extension .R and then it would be called an R script and you could run it either by copying and pasting the contents into the prompt or by typing 10:37 AM source(yourfilename.R) at the command line (or some other ways from the menus, I think) it is useful to make and save these .R files (especially with loading the dataset at the top) 10:38 AM because they document the stats that you ran and make it very easy and transparent for other people to see what you did make sense? me: um, so when you say write out all the binconf, you mean write them in a text editor like emacs right and then load it into R 10:39 AM Heather: yes me: if I save it as an r script ok Heather: yes basically you'd have a lot of lines of bioconf, one for each percentage you calculate me: so does r allow you to use # for comments Heather: maybe a varient of this to make it easier to read round(100*binconf(40, 307), 1) yes 10:40 AM me: that way I could name them ok good Heather: I'm not necessarily suggesting that you do this there are some better ways to do it more easily, perhaps, but it might be worth doing it for a few to get the hang of it and a gut feel for how "wide" your confidence intervals are for your different data sets, 10:41 AM given the questions that you asked and the differeing sizes of your samples me: ok so I should just explore a bit with these Heather: yup me: ok 10:42 AM Heather: let me give you a bit more to chew on before I run to my next eeting try this Hmisc::describe(dat.raw) you get a long list of stuff? me: yup 10:43 AM Heather: ok, that can be a good way to have a look at your data me: whoa this is awesome Heather: one of the things you will realize as you do that is that some of your fields aren't coded for easy stat analysis yet for example, 10:44 AM dat.raw$Policy.Requested.or.Required.Authors.to.share.data..for.publication has some random comments in it which is useful, but you need a cleaner column that that if you want to look at a percentage, right? 10:45 AM one way you can get a cleaner column is to make one automatically. me: right Heather: for example, you could make a new column called "requested" that is either TRUE or FALSE me: so I tried to code these doing 0,1,2 Heather: ok. let me show you one way do do it useing text matches, in case that is helpful for this or somehting else.... me: creating a new column that said 0 no request 1 request 2 required ( for example) 10:46 AM Heather: two steps first make a new column that is all false dat.raw$requested = FALSE then, change just some of them to true, the ones that contain the word "Requested" dat.raw$requested[grep("+Requested+", dat.raw$Policy.Requested.or.Required.Authors.to.share.data..for.publication)] = TRUE 10:47 AM then when you type dat.raw$requested you can see the cleaner column this may not be ideal for this case, I agree a manual coding of 0,1,2 might be better but maybe for pulling out various publishers, ISI categories, whatever one more thing once you have a binary column 10:48 AM = one that is just TRUE/FALSE or 1/0 you can do a nice plot to look at the odds of something so the percentage time that it happens, in a subset of your data 10:49 AM can you run the two dat.raw$requested lines above, first? so dat.raw$requested = FALSE then dat.raw$requested[grep("+Requested+", dat.raw$Policy.Requested.or.Required.Authors.to.share.data..for.publication)] = TRUE me: I got this response-- Error in `$<-.data.frame`(`*tmp*`, "requested", value = c(NA, NA, TRUE, : replacement has 34 rows, data has 39 10:50 AM I think I need to clean up the .csv file ? Heather: Hmmmm I wonder if I cleaned up my data a bit more than you did yeah, I'm guessing so me: ok 10:52 AM Heather: hrm, I was trying to show you something else but out of time to get it to work I have a 3 hr online meeting now me: ick Heather: then could be avail from 2-3 my time or after 7 my time or tomorrow..... me: whatever works best for you I am going to make sure these are more normalized Heather: ok, let's play it by ear a bit 10:53 AM me: and then play around with what you gave me Heather: yup, normalizing would help me: I should be online most of the day though Heather: great me: I;ll leave my chat open Heather: send me email if/when you run into things that are confusing? me: ok Heather: ok. looking forward to getting into more good stuff today and tomorrow :) bye! 10:54 AM me: thanks for the help! bye