DataONE:Notebook/Summer 2010/2010/07/23

From OpenWetWare

Jump to: navigation, search
 
2 12:07 PM 
me: Hi Heather, whenever is convenient to chat let me know
 
Heather: Hi Nic!
  
I'm doing some stat analysis prepping for a talk with Sarah right now... but I have a few minutes
12:08 PM 
what is low-hanging fruit that we can cover so that you can keep going?
  
I think how to do subsets?
  
also, maybe I'll introduce you to boxplots
 
me: yes
 
Heather: ideally we would do analysis-planning but that might take longer :)
12:09 PM 
ok
  
have you done any database SQL stuff before, by chance?
 
me: yeah, I need to focus in on that
  
no I haven't
 
Heather: ok, no problem
  
if you had, there is an sql-type way to do it
  
so what are you trying and what isn't working?
12:10 PM 
me: what I am really confused about is how to represent required and request as different variables
  
not variables, but setting >1 to required
  
which corresponds with my code
 
Heather: are you confused about why we might want to do that? or how we do that? or both?
12:11 PM 
me: no I undertsand why, I just don't understand how
 
Heather: ok
  
so you have a variable called Policy.request...require.code right?
  
and it has 0 1 and 2s?
 
me: yes
12:12 PM 
Heather: and if you do
  
table(journdat$Policy.request...require.code)
  
you can see the relative numbers of 0s and 1s and 2s
 
me: yes
 
Heather: and if you do
  
dim(journdat)
  
you can see how big your whole "dataframe" is
  
307 rows, 54 columns
12:13 PM 
me: right
 
Heather: now you don't want all 307 rows, you just want the ones with a code of 1 or 2
  
so that would be 29+10 = 39, right
 
me: yes, but how to get rid of the 0's and not lose the 1's?
12:14 PM 
Heather: another way to look at journdat is
  
journdat[,]
  
the [] notation is "indexing into the matrix" that is journdat
  
joundat has two dimensions, rows and columns
12:15 PM 
and these are separated by a , inside the []s
  
if you just do journdat[,] you get everything
  
ie
  
dim(journdat[,])
  
but now let's say you want to be pickier
  
you just want the ones where code is >0
  
but you still want all the columns
12:16 PM 
so yyou make a "logical" that picks out the rows you want
  
like this
  
journdat$Policy.request...require.code >0
  
and then you put that inside the rows indexing
  
like this
  
dim(journdat[journdat$Policy.request...require.code >0,])
12:17 PM 
or instead of the dim you could do str like this
  
str(journdat[journdat$Policy.request...require.code >0,])
  
or
  
summary(journdat[journdat$Policy.request...require.code >0,])
  
or whatever
  
you can also assign this new subset thing to a new variable, like this
12:18 PM 
journdat.request.code.gt.0 = journdat[journdat$Policy.request...require.code >0,]
  
and then use "journdat.request.code.gt.0"
  
whereever you used to use journdat
  
in the summary etc
 
me: right but that gives us both required and requested right?
 
Heather: right
 
me: becaues its everything over 0
 
Heather: right
12:19 PM 
and I was imaginging that you'd want the dataset that includes both request and required... but
 
me: yes
  
thats what I don't undertand
 
Heather: when looking at it in summary, you'd look at things that distinguished request from required
  
ok.
  
with your new dataframe you can do
12:20 PM 
table(journdat.request.code.gt.0$Policy.request...require.code)
  
and it just comes back with 1s and 2s as we expect
  
so now just make a request vs required column that codes it in terms of 0s and 1s
12:21 PM 
(the reason I think you want it in terms of 0s and 1s rather than 1s and 2s is that it makes the
  
summary plot easier to interpret, I think. I don't know this for sure. you could try the summary plot
12:22 PM 
using "Policy.request...require.code" before the ~ and the new dataframe at the end....
  
but here's what I'd do...)
12:23 PM 
journdat.request.code.gt.0$is.required = 0
  
then
  
journdat.request.code.gt.0$is.required[journdat.request.code.gt.0$Policy.request...require.code > 1] = 1
  
does that make sense?
  
then use "is.required" before the ~
  
where we first experimented with "simple.var"
12:24 PM 
(for what it is worth, you can also easily get the same thing by doing
  
journdat.request.code.gt.0$is.required = journdat.request.code.gt.0$Policy.request...require.code - 1
  
)
12:25 PM 
have I totally confused you?
 
me: ok, I'm going to try this, I thought this was what I was doing, but I've gotten a little turned around
 
Heather: if so, that is fine.
  
is your code for this up? I didn't obviously see it
 
me: no no, I understand making new dataframes
 
Heather: maybe I didn't click through the right thing
12:26 PM 
me: When you enter: journdat.request.code.gt.0$is.required = 0
then
journdat.request.code.gt.0$is.required[journdat.request.code.gt.0$Policy.request...require.code > 1] = 1
12:27 PM 
doesnt that put requested and no request both back to 0?
12:28 PM 
Heather: hmmm, what do you mean by
  
"requested and no request"
12:29 PM 
me: all values below 2
 
Heather: yes
  
isn't that what you want?
  
then the is.required variable is 1 only for the datasets where sharing is required
  
and 0 when it is either no policy or merely requested
12:30 PM 
me: .... yes?
 
Heather: you're not sure it that is what you were wanting?
  
that's ok!
12:31 PM 
me: for some reason, I was trying to represent only requested with the 0's and only required with the 1's
 
Heather: oh I see.
  
and you were confused about the fact that it would in theory make the "no policies" be 0 also?
12:32 PM 
me: yes, although I didn't really stop and think, I just kept plugging...( I think I've already done the plots for only required)
12:33 PM 
Heather: gotcha. yeah, well I think that isn't really a problem in this case
  
because there are no "no policies" in this subset
  
because we got rid of them all :)
12:34 PM 
are you clear, or more fuzzy or ?
 
me: I think I'm clear I want to go back and make sure I understand... I'll post the code and plot on OWW
 
Heather: ok.
12:35 PM 
for what it is worth, I'm not 100% sure you actually want to do a lot with the subset
  
it depends on research questions
  
so if you get stuck on it, don't dwell, just keep plugging on other things
 
me: ok,,, so it might be better to spend my time looking at what I'm trying to get out of the stats
12:36 PM 
Heather: right and I want to show you another thing
 
me: ok
 
Heather: try this
  
boxplot(Impact.Factor ~ Policy.request...require.code, journdat)
12:37 PM 
do you get a plot?
 
me: yes
12:38 PM 
Heather: ok, so what you are seeing on the x axis is the three levels of your code variable
  
no policy, requests, required
 
me: right
  
and impact factor on the y
 
Heather: and on the y axis it is plotting the impact factor
  
right
  
have you seen boxplots before?
12:39 PM 
the dark line is the mean
 
me: I have not
 
Heather: the "whiskers" show the range of most of the applicable datapoints
  
with a few outliers showing up as "o"s
  
and the box itself shows where most of the data is
12:40 PM 
so looking at this
  
it says to me that the mean, average, impact factor is higher as policies get stricter
  
though there isn't a whole lot of difference between levels 1 and 2
12:41 PM 
since their boxes mostly cover the same range of impact factors
  
this is useful beyond what we were looking at in the summary dot tables
  
because there we had collapsed policy request and required to be the same....
  
and here we are can look at them individually
12:42 PM 
does that make sense?
 
me: yes
 
Heather: cool
  
one more thing
12:43 PM 
that sort of plot (or others like it)
  
is useful when you want to see a continuous variable, like impact factor, across more than two categories (like code)
  
but you also have a lot of binary variables like is.Wiley
12:44 PM 
there are ways to show that info too
  
in a table:
  
table(journdat$is.Wiley, journdat$Policy.request...require.code)
  
or in a funky plot
  
plot(table(journdat$is.Wiley, journdat$Policy.request...require.code))
12:45 PM 
anyway, mostly here just wanted to expose you to the fact that there are ways we can analyze and plot
  
your code variables while keepign their three levels
  
I know we've collapsed them so far
  
but there are advantages to looking at all three distinct levels at the same time
  
so we'll try to do that too
  
make sense?
12:47 PM 
me: yes
 
Heather: cool
  
btw, did you get a chance to install Mondrian?
  
it is kind of picky about getting data in
 
me: I cant remember, if not I'll do so
 
Heather: but after that has some nice data viz opportunities
12:48 PM 
me: oh i did
 
Heather: yeah. play, and if you have lots of trouble loading in your data
  
then just get a few key columns, open them up in Excel, save as tab delimeted and try to import
  
it is really picky about not having any blank cells, fyi
  
ok, if you have stuff to keep going on, maybe I'll leave it at that for a few hours while I go look at Sarah's stuff?
12:49 PM 
me: ok, yeah I still have the second half of that tutorial you sent me to look at
 
Heather: cool. don't get too hung up on the tutorial, it is without a doubt a hard read, beyond where you are at
12:50 PM 
me: yes, yes it is
 
Heather: if you can get mondrian going, though, that woudl be cool
 
me: so I load dataframes into mondrian?
 
Heather: it would be fiddly with getting data in. you can in theory save your dataset in R as a ".Rdata" file and then load directly into mondrian
  
but fiddly.
12:51 PM 
ok, bye for now
 
me: ok thanks
 
Heather: (mondrian has some good docs. mostly, try starting small, with just two clean variables maybe?
12:52 PM 
you can use
  
?select
  
nope I mean
  
?subset
  
to make a dataframe with just a few of your variables
  
(or this is another way to pick just a subset of the rows....)
  
ok, off now.......
12:53 PM 
oh, wait
  
do you know about c()
  
c is how you make list of things
  
so if you wanted just two non-adjacent columns from your data you would say
12:54 PM 
small.dataframe = subset(journdat, select=c(is.Wiley, Policy.request...require.code))
  
dim(small.dataframe)
  
ok?
12:55 PM 
me: oj
  
I'll try it
 
Heather: ok, cool. bye!
 
Personal tools