Wilke:Statistics errors: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 32: Line 32:
</pre>
</pre>


The correlation is highly significant (''P''<2.2e-16) but weak (''r''=0.10). The variable ''x'' explains only 1% (''r''^2) of the variation in ''y''. In terms of the birds example, this could mean that while berry consumption is indeed related to mating success, but the relationship is so weak as to be virtually meaningless. (Knowing how many berries a given male bird ate tells me pretty much nothing about his specific mating success.)
The correlation is highly significant (''P''<2.2e-16) but weak (''r''=0.10). The variable ''x'' explains only 1% (that is the square of the correlation coefficient, ''r''^2) of the variation in ''y''. In terms of the birds example, this could mean that while berry consumption is indeed related to mating success, the relationship is so weak as to be virtually meaningless. (Knowing how many berries a given male bird ate tells me pretty much nothing about his specific mating success.)


Now let's aggregate the data into quantiles of ''x'' and plot the mean +- standard error of ''y'' within each quantile of ''x'':
Now let's aggregate the data into quantiles of ''x'' and plot the mean +- standard error of ''y'' within each quantile of ''x'':

Revision as of 18:54, 17 September 2012

Notice: The Wilke Lab page has moved to http://wilkelab.org.
The page you are looking at is kept for archival purposes and will not be further updated.
THE WILKE LAB

Home        Contact        People        Research        Publications        Materials

Common errors in statistical analyses

Aggregation by quantiles erroneously amplifies trend

In many situations, one would like to know how one quantitative variable relates to another. For example, we might be studying a certain bird and wondering whether the amount of a certain berry that male birds eat has an effect on the mating success of those males. The canonical (and correct) way to study such questions is via correlation analyses.

However, it is surprisingly common to see analyses where instead one of the variables is aggregated into quantiles (groups of equal size) and the second variable is presented as an average of the quantiles of the first variable. In the above example, we might classify birds into four groups (quartiles) by their berry consumption (lowest 25%, second lowest 25%, and so on) and then plot the mean mating success within each group as a function of the quartile of berry consumption. Such an analysis is misleading, because it erroneously amplifies any relationship that may exist between the two variables.

Let's illustrate this issue with some simulated data, using R.

First we generate two variables x and y, weakly correlated:

x = rnorm(n)
y = 0.1*x + 0.9*rnorm(n)
cor.test( x, y )

This is the output from the cor.test function:

        Pearson's product-moment correlation

data:  x and y 
t = 10.485, df = 9998, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.084862 0.123636 
sample estimates:
      cor 
0.1042886 

The correlation is highly significant (P<2.2e-16) but weak (r=0.10). The variable x explains only 1% (that is the square of the correlation coefficient, r^2) of the variation in y. In terms of the birds example, this could mean that while berry consumption is indeed related to mating success, the relationship is so weak as to be virtually meaningless. (Knowing how many berries a given male bird ate tells me pretty much nothing about his specific mating success.)

Now let's aggregate the data into quantiles of x and plot the mean +- standard error of y within each quantile of x:

# calculate to which quantile each x belongs
qn = 10 # number of quantiles
q = quantile(x, probs = seq(0, 1, 1/qn))
q[length(q)] = q[length(q)] + 1 # make sure the last quantile is larger than max x
quant.x = tapply(x, 1:n, (function(x) sum(x>=q)) )

# calculate means and SEs of y per quantile
require( Hmisc ) # for errbar plot
mean.quant = tapply( y, quant.x, mean )
SE.quant = tapply( y, quant.x, (function(x) sd(x)/sqrt(length(x))) )
errbar( 1:qn, mean.quant, mean.quant+SE.quant, mean.quant-SE.quant, xlab='quantiles(x)', ylab='mean(y) for quantile' )