DataONE:Notebook/Data Citation and Sharing Policy/2010/07/27

From OpenWetWare

Jump to: navigation, search
Project name Main project page
Previous entry      Next entry

Cleaner Analysis

  • Nic Weber 14:13, 27 July 2010 (EDT): glm with relevel to talk about with Heather:
> filename = "/Users/nicholasweber/Desktop/JournalData1.csv"
> mydata = read.csv(filename)
> ImFa = Impact.Factor
> ImFa[ImFa==0] = NA
> hist(ImFa)
> summary(ImFa)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.064   1.000   1.578   2.132   2.762  16.690   6.000 
> SomeOA = ifelse(Subscription.Model == "Sub", 0, 1)
> table(SomeOA)
SomeOA
  0   1 
236  71 
> Afil = ifelse(Affiliation.Code > 0, 1, 0) 
> table(Afil)
Afil
  0   1 
148 158 
> table(PubCode)
PubCode
   other elsevier springer    wiley 
     149       52       58       48 
> PubCode = relevel(PubCode, ref="other")
> is.EnvSci = rep(0, length(ISI.Category))
> is.EnvSci[grep("*Environmental Sciences*", ISI.Category)] = 1 
> table(is.EnvSci)
is.EnvSci
  0   1 
143 164 
> is.Eco = rep(0, length(ISI.Category))
> is.Eco[grep("*Ecology*", ISI.Category)] = 1
> table(is.Eco)
is.Eco
  0   1 
181 126 
> is.EvoBio = rep(0, length(ISI.Category))
> is.EvoBio[grep("*Evolutionary Biology*", ISI.Category)] =1
> table(is.EvoBio)
is.EvoBio
  0   1 
267  40 
> 
> 
> 
> mylogit = glm(requests~log(ImFa)+ Afil+ PubCode+ is.Eco+ is.EnvSci+ is.EvoBio, family=binomial(link="logit"), na.action=na.omit) ## log creates even distribution for IF
> summary(mylogit)

Call:
glm(formula = requests ~ log(ImFa) + Afil + PubCode + is.Eco + 
    is.EnvSci + is.EvoBio, family = binomial(link = "logit"), 
    na.action = na.omit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0326  -0.5140  -0.3973  -0.3177   2.5392  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -2.8726     0.7481  -3.840 0.000123 ***
log(ImFa)         0.2252     0.2738   0.822 0.410910    
Afil              0.4374     0.4100   1.067 0.285966    
PubCodeelsevier   1.4796     0.4888   3.027 0.002472 ** 
PubCodespringer  -0.3144     0.6995  -0.450 0.653050    
PubCodewiley      0.9403     0.5256   1.789 0.073634 .  
is.Eco           -0.2317     0.5807  -0.399 0.689973    
is.EnvSci         0.2959     0.6399   0.462 0.643746    
is.EvoBio        -0.1708     0.7475  -0.228 0.819299    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 224.11  on 299  degrees of freedom
Residual deviance: 205.36  on 291  degrees of freedom
  (7 observations deleted due to missingness)
AIC: 223.36

Number of Fisher Scoring iterations: 5

> confint(mylogit)
Waiting for profiling to be done...
                     2.5 %     97.5 %
(Intercept)     -4.3924248 -1.4488793
log(ImFa)       -0.3014913  0.7736730
Afil            -0.3572635  1.2583180
PubCodeelsevier  0.5290583  2.4588049
PubCodespringer -1.8708292  0.9588558
PubCodewiley    -0.1174621  1.9669483
is.Eco          -1.4100069  0.8765251
is.EnvSci       -0.9617093  1.5543530
is.EvoBio       -1.7375732  1.2369173
> exp(mylogit$coefficients)
    (Intercept)       log(ImFa)            Afil PubCodeelsevier PubCodespringer    PubCodewiley          is.Eco       is.EnvSci 
     0.05654958      1.25255157      1.54872058      4.39102729      0.73020193      2.56077481      0.79322294      1.34440052 
      is.EvoBio 
     0.84301868 
> exp(confint(mylogit))
Waiting for profiling to be done...
                     2.5 %     97.5 %
(Intercept)     0.01237070  0.2348333
log(ImFa)       0.73971425  2.1677138
Afil            0.69958816  3.5194967
PubCodeelsevier 1.69733323 11.6908318
PubCodespringer 0.15399592  2.6087098
PubCodewiley    0.88917420  7.1488273
is.Eco          0.24414160  2.4025365
is.EnvSci       0.38223896  4.7320241
is.EvoBio       0.17594686  3.4449772







  • Nic Weber 13:37, 27 July 2010 (EDT):I am going to attempt to clean up some of the last post. The code that I've just uploaded is embedded below, The changes include a summary for the Impact Factor category, and I called out the categories from the PubCode column in my dataset for individual tables. I neglected to properly include all of the publisher categories in the last code, so the stats have also changed. I will include those below as well:

  • This code then obviously changed my stats, and the Other category is no longer statitically significant. I will include the Other category (coded as OthPub) just for comparison.
    • The Coefficients including the P values for Impact Factor and Society Affiliation:
          Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.55278    1.23451  -3.688 0.000226 ***
log(ImFa)    1.04003    0.29007   3.585 0.000337 ***   
Afil         1.06761    0.46302   2.306 0.021125 *     
OthPub       1.60747    1.09173   1.472 0.140913 
    • Confidence Int. for Coefficients:
     Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.55278    1.23451  -3.688 0.000226 ***
log(ImFa)    1.04003    0.29007   3.585 0.000337 ***  
Afil         1.06761    0.46302   2.306 0.021125 *     
OthPub       1.60747    1.09173   1.472 0.140913 
    • Exponents:
(Intercept)   log(ImFa)      S        Afil           OthPub     
 0.01053787  2.82930675   2.90840677      4.99014645 
    • Exp Conf Int.
      2.5 %     97.5 %
log(ImFa)   1.6484164883  5.1777868
Afil        1.2128683481  7.5593332
OthPub      0.8527096723 95.7104652 
    • Full Stats:
> filename = "/Users/nicholasweber/Desktop/JournalData1.csv"
> mydata = read.csv(filename) 
> ImFa = Impact.Factor
> ImFa[ImFa==0] = NA
> hist(ImFa)
> summary(ImFa)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.064   1.000   1.578   2.132   2.762  16.690   6.000 
> SomeOA = ifelse(Subscription.Model == "Sub", 0, 1)
> table(SomeOA)
SomeOA
  0   1 
223  84 
> Afil = ifelse(Affiliation.Code > 0, 1, 0)] # Society Affiliation 
Error: unexpected ']' in "Afil = ifelse(Affiliation.Code > 0, 1, 0)]"
> table(Afil)
Afil
  0   1 
148 158 
> is.EnvSci = rep(0, length(ISI.Category))
> is.EnvSci[grep("*Environmental Sciences*", ISI.Category)] = 1 
> table(is.EnvSci)
is.EnvSci
  0   1 
143 164 
> is.Eco = rep(0, length(ISI.Category))
> is.Eco[grep("*Ecology*", ISI.Category)] = 1
> table(is.Eco)
is.Eco
  0   1 
181 126 
> is.EvoBio = rep(0, length(ISI.Category))
> is.EvoBio[grep("*Evolutionary Biology*", ISI.Category)] =1
> table(is.EvoBio)
is.EvoBio
  0   1 
267  40 
> 
> Springer = rep(0, length(PubCode))
> Springer [grep("*springer*", PubCode)] =1
> table(Springer)
Springer
  0   1 
249  58 
> Elsevier = rep(0, length(PubCode))
> Elsevier [grep("*elsevier*", PubCode)] =1
> table(Elsevier)Wiley
Error: unexpected symbol in "table(Elsevier)Wiley"
> Wiley = rep(0, length(PubCode))
> Wiley [grep("*wiley*", PubCode)] =1
> table(Wiley)
Wiley
  0   1 
259  48 
> OthPub = rep(0, length(PubCode))
> OthPub [grep("*other*", PubCode)] =1
> table(OthPub) #Includes all other publishers from dataset
OthPub
  0   1 
182 125 
> 
> mylogit = glm(requests~log(ImFa) + SomeOA+ Afil+ Elsevier+ Springer+ Wiley+ OthPub+ is.Eco+ is.EnvSci + is.EvoBio, family=binomial(link="logit"), na.action=na.omit) ## log creates even distribution for IF
> summary(mylogit)

Call:
glm(formula = requests ~ log(ImFa) + SomeOA + Afil + Elsevier + 
    Springer + Wiley + OthPub + is.Eco + is.EnvSci + is.EvoBio, 
    family = binomial(link = "logit"), na.action = na.omit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5987  -0.5199  -0.3057  -0.1653   2.9759  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.55278    1.23451  -3.688 0.000226 ***
log(ImFa)    1.04003    0.29007   3.585 0.000337 ***
SomeOA      -0.02429    0.43966  -0.055 0.955949    
Afil         1.06761    0.46302   2.306 0.021125 *  
Elsevier     0.07862    1.20986   0.065 0.948188    
Springer    -0.54932    1.45679  -0.377 0.706117    
Wiley        1.19822    1.14467   1.047 0.295199    
OthPub       1.60747    1.09173   1.472 0.140913    
is.Eco      -0.33555    0.57484  -0.584 0.559403    
is.EnvSci    0.46216    0.65644   0.704 0.481416    
is.EvoBio    0.75327    0.67710   1.112 0.265925    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 224.11  on 299  degrees of freedom
Residual deviance: 179.30  on 289  degrees of freedom
  (7 observations deleted due to missingness)
AIC: 201.3

Number of Fisher Scoring iterations: 6

> confint(mylogit)
Waiting for profiling to be done...
                 2.5 %     97.5 %
(Intercept) -7.6716071 -2.4547973
log(ImFa)    0.4998151  1.6443777
SomeOA      -0.9051808  0.8301939
Afil         0.1929881  2.0227830
Elsevier    -2.1005325  3.1452976
Springer    -3.8359538  2.7374242
Wiley       -0.7306740  4.2048215
OthPub      -0.1593362  4.5613276
is.Eco      -1.4946319  0.7684235
is.EnvSci   -0.8165620  1.7687779
is.EvoBio   -0.5939997  2.0820360
> exp(mylogit$coefficients)
(Intercept)   log(ImFa)      SomeOA        Afil    Elsevier    Springer       Wiley      OthPub      is.Eco   is.EnvSci   is.EvoBio 
 0.01053787  2.82930675  0.97600675  2.90840677  1.08179389  0.57734165  3.31422494  4.99014645  0.71494624  1.58749158  2.12394266 
> exp(confint(mylogit)) # conf int for exp
Waiting for profiling to be done...
                   2.5 %     97.5 %
(Intercept) 0.0004658685  0.0858806
log(ImFa)   1.6484164883  5.1777868
SomeOA      0.4044687314  2.2937636
Afil        1.2128683481  7.5593332
Elsevier    0.1223912330 23.2265858
Springer    0.0215807450 15.4471456
Wiley       0.4815843005 67.0086336
OthPub      0.8527096723 95.7104652
is.Eco      0.2243311619  2.1563641
is.EnvSci   0.4419484746  5.8636828
is.EvoBio   0.5521145557  8.0207830
  • Nic Weber 12:16, 27 July 2010 (EDT):Below contains mistakes -- As of 11:15 am cst I am cleaning up the code for Publishers... Another post to follow

(posted at apx 10:30 cst) Today I have cleaned up some of my code from yesterday and updated my public dataset to reflect the changes in columns Publisher Code, and some cleaning in the Subscription Model column.

  • To begin, I ran the following code:

  • This gave me the following significant results (Full Results from R below)
    • For P Values of Impact Factor, Society Affiliation, and all publishers other than Wiley, Springer, Elsevier and Taylor Francis Ltd.
Coefficients:
                Estimate Std. Error z value Pr(>|z|) 
log(ImFa)        1.04003    0.29007   3.585 0.000337 ***
Afil             1.06761    0.46302   2.306 0.021125 *  
PubCodeother     1.52884    0.71422   2.141 0.032309 * 
    • With confidence intervals of :
                      2.5 %     97.5 %
log(ImFa)              0.4998151  1.6443777
Afil                         0.1929881  2.0227830
PubCodeother     0.2325047  3.1117025
    • And exp of :
   log(ImFa)                 Afil                      PubCodeother 
     2.82930675            2.90840677      4.61284400 
    • With exp confidence intervals of:
                     2.5 %     97.5 %
log(ImFa)       1.648416488  5.1777868
Afil            1.212868348  7.5593332
PubCodeother    1.261756334 22.4592496
    • Below are the full Results for context
> summary(mylogit)

Call:
glm(formula = requests ~ log(ImFa) + SomeOA + Afil + PubCode + 
    is.Eco + is.EnvSci + is.EvoBio, family = binomial(link = "logit"), 
    na.action = na.omit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5987  -0.5199  -0.3057  -0.1653   2.9759  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -4.47416    0.94554  -4.732 2.22e-06 ***
log(ImFa)        1.04003    0.29007   3.585 0.000337 ***
SomeOA          -0.02429    0.43966  -0.055 0.955949    
Afil             1.06761    0.46302   2.306 0.021125 *  
PubCodeother     1.52884    0.71422   2.141 0.032309 *  
PubCodespringer -0.62794    1.19405  -0.526 0.598963    
PubCodetaylor   -0.07862    1.20986  -0.065 0.948188    
PubCodewiley     1.11960    0.76456   1.464 0.143093    
is.Eco          -0.33555    0.57484  -0.584 0.559403    
is.EnvSci        0.46216    0.65644   0.704 0.481416    
is.EvoBio        0.75327    0.67710   1.112 0.265925    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 224.11  on 299  degrees of freedom
Residual deviance: 179.30  on 289  degrees of freedom
  (7 observations deleted due to missingness)
AIC: 201.3

Number of Fisher Scoring iterations: 6

> confint(mylogit)
Waiting for profiling to be done...
                     2.5 %     97.5 %
<b>(Intercept)     -6.4901304 -2.7394727</b>
log(ImFa)        0.4998151  1.6443777
SomeOA          -0.9051808  0.8301939
Afil             0.1929881  2.0227830
PubCodeother     0.2325047  3.1117025
PubCodespringer -3.6768695  1.5175314
PubCodetaylor   -3.1452976  2.1005325
PubCodewiley    -0.3119753  2.7709763
is.Eco          -1.4946319  0.7684235
is.EnvSci       -0.8165620  1.7687779
is.EvoBio       -0.5939997  2.0820360
> exp(mylogit$coefficients)
    (Intercept)       log(ImFa)          SomeOA            Afil    PubCodeother PubCodespringer   PubCodetaylor    PubCodewiley 
     0.01139980      2.82930675      0.97600675      2.90840677      4.61284400      0.53368914      0.92439051      3.06363807 
         is.Eco       is.EnvSci       is.EvoBio 
     0.71494624      1.58749158      2.12394266 
> exp(confint(mylogit)) # conf int for exp
Waiting for profiling to be done...
                      2.5 %     97.5 %
(Intercept)     0.001518351  0.0646044
log(ImFa)       1.648416488  5.1777868
SomeOA          0.404468731  2.2937636
Afil            1.212868348  7.5593332
PubCodeother    1.261756334 22.4592496
PubCodespringer 0.025302058  4.5609519
PubCodetaylor   0.043054111  8.1705199
PubCodewiley    0.731999586 15.9742218
is.Eco          0.224331162  2.1563641
is.EnvSci       0.441948475  5.8636828
is.EvoBio       0.552114556  8.0207830 


Personal tools