User:Timothee Flutre/Notebook/Postdoc/2011/06/28
From OpenWetWare
Project name | Main project page Next entry |
Linear regression by ordinary least squares
In vector-matrix notation: with and B^{T} = [μβ] The parameters of the model are: Θ = {μ,β,σ}
The well-known ordinary-least-square (OLS) estimator of B is:
Let's now define 4 summary statistics, very easy to compute:
This allows to obtain the estimate of the effect size only by having the summary statistics available:
By the way, in this case (i.e. simple linear regression, a single predictor), it's easy to see that:
The same works for the estimate of the standard deviation of the errors:
We can also benefit from this for the standard error of the parameters:
which corresponds to:
V(y) = V(μ + βg + ε) = V(μ) + V(βg) + V(ε) = β^{2}V(g) + σ^{2} The most intuitive way to simulate data is therefore to fix the proportion of variance in y explained by the genotype, for instance PVE = 60%, as well as the standard deviation of the errors, typically σ = 1. From this, we can calculate the corresponding effect size β of the genotype:
Therefore: Note that g is the random variable corresponding to the genotype encoded in allele dose, such that it is equal to 0, 1 or 2 copies of the minor allele. For our simulation, we will fix the minor allele frequency f (eg. f = 0.3) and we will assume Hardy-Weinberg equilibrium. Then g is distributed according to a binomial distribution with 2 trials for which the probability of success is f. As a consequence, its variance is V(g) = 2f(1 − f). Here is some R code implementing all this: set.seed(1859) N <- 100 # sample size mu <- 4 pve <- 0.6 sigma <- 1 maf <- 0.3 # minor allele frequency beta <- sigma * sqrt(pve / ((1 - pve) * 2 * maf * (1 - maf))) # 1.89 g <- rbinom(n=N, size=2, prob=maf) # assuming Hardy-Weinberg equilibrium y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma) ols <- lm(y ~ g) summary(ols) # muhat=4.1+-0.13, betahat=1.6+-0.16, R2=0.49 sqrt((1/(N-2) * sum(ols$residuals^2))) # sigmahat=0.99 plot(x=0, type="n", xlim=range(g), ylim=range(y), xlab="genotypes (allele counts)", ylab="phenotypes", main="Simple linear regression") for(i in unique(g)) points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19) abline(a=coefficients(ols)[1], b=coefficients(ols)[2])
As above, we want , and . To efficiently get them, we start with the singular value decomposition of X: X = UDV^{T} This allows us to get the Moore-Penrose pseudoinverse matrix of X: X^{ + } = (X^{T}X)^{ − 1}X^{T} = VD^{ − 1}U^{T} From this, we get the OLS estimate of the effect sizes:
Then it's straightforward to get the residuals:
With them we can calculate the estimate of the error variance:
And finally the standard errors of the estimates of the effect sizes:
We can check this with some R code: ## simulate the data set.seed(1859) N <- 100 mu <- 4 pve.g <- 0.4 # genotype pve.c <- 0.2 # other covariate, eg. gender sigma <- 1 maf <- 0.3 sex.ratio <- 0.5 beta.g <- sigma * sqrt((1 / (2 * maf * (1 - maf))) * (pve.g / (1 - pve.g - pve.c))) # 1.543 beta.c <- beta.g * sqrt((pve.c / pve.g) * (2 * maf * (1 - maf) / sex.ratio * (1 - sex.ratio))) # 0.707 x.g <- rbinom(n=N, size=2, prob=maf) x.c <- rbinom(n=N, size=1, prob=sex.ratio) y <- mu + beta.g * x.g + beta.c * x.c + rnorm(n=N, mean=0, sd=sigma) ols <- lm(y ~ x.g + x.c) summary(ols) # muhat=3.9+-0.17, beta.g.hat=1.6+-0.17, beta.c.hat=0.58+-0.21, R2=0.51 sqrt((1/(N-3)) * sum(ols$residuals^2)) # sigma.hat = 1.058 ## perform the OLS analysis with the SVD of X X <- cbind(rep(1,N), x.g, x.c) Xp <- svd(x=X) B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y E.hat <- y - X %*% B.hat sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 1.058 var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v) sqrt(diag(var.theta.hat)) # 0.168 0.175 0.212 Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL (here). |