User:Timothee Flutre/Notebook/Postdoc/2011/06/28
From OpenWetWare
(→Simple linear regression: add regression with multiple predictors) |
(→Linear regression by ordinary least squares: improv simul with PVE + expand variance geno) |
||
Line 76: | Line 76: | ||
- | * '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case | + | * '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case, the model is <math>y = \mu + \beta g + \epsilon</math>). Therefore, the variance of <math>y</math> can be decomposed like this: |
- | <math> | + | <math>V(y) = V(\mu + \beta g + \epsilon) = V(\mu) + V(\beta g) + V(\epsilon) = \beta^2 V(g) + \sigma^2</math> |
- | + | The most intuitive way to simulate data is therefore to fix the proportion of variance in <math>y</math> explained by the genotype, for instance <math>PVE=60%</math>, as well as the standard deviation of the errors, typically <math>\sigma=1</math>. From this, we can calculate the corresponding effect size <math>\beta</math> of the genotype: | |
- | <math> | + | <math>PVE = \frac{V(\beta g)}{V(y)}</math> |
- | Here is some R code implementing this: | + | Therefore: |
+ | <math>\beta = \pm \sigma \sqrt{\frac{PVE}{(1 - PVE) * V(g)}}</math> | ||
+ | |||
+ | Note that <math>g</math> is the random variable corresponding to the genotype encoded in allele dose, such that it is equal to 0, 1 or 2 copies of the minor allele. For our simulation, we will fix the minor allele frequency <math>f</math> (eg. <math>f=0.3</math>) and we will assume Hardy-Weinberg equilibrium. Then <math>g</math> is distributed according to a binomial distribution with 2 trials for which the probability of success is <math>f</math>. As a consequence, its variance is <math>V(g)=2f(1-f)</math>. | ||
+ | |||
+ | Here is some R code implementing all this: | ||
<nowiki> | <nowiki> | ||
set.seed(1859) | set.seed(1859) | ||
- | N <- 100 | + | N <- 100 # sample size |
- | mu <- | + | mu <- 4 |
- | g <- sample(x=0:2, size=N, replace=TRUE, prob=c( | + | pve <- 0.6 |
- | + | sigma <- 1 | |
- | + | maf <- 0.3 | |
- | + | beta <- sigma * sqrt(pve / ((1 - pve) * 2 * maf * (1 - maf))) # 1.88 | |
- | + | g <- sample(x=0:2, size=N, replace=TRUE, prob=c(maf^2, 2*maf*(1-maf), (1-maf)^2)) | |
y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma) | y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma) | ||
+ | ols <- lm(y ~ g) | ||
+ | summary(ols) # muhat=3.5, betahat=2.1, R2=0.64 | ||
+ | sqrt(mean(ols$residuals^2)) # sigmahat = 0.98 | ||
plot(x=0, type="n", xlim=range(g), ylim=range(y), | plot(x=0, type="n", xlim=range(g), ylim=range(y), | ||
- | xlab="genotypes | + | xlab="genotypes", ylab="phenotypes", |
main="Simple linear regression") | main="Simple linear regression") | ||
for(i in unique(g)) | for(i in unique(g)) | ||
points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19) | points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19) | ||
- | |||
- | |||
abline(a=coefficients(ols)[1], b=coefficients(ols)[2]) | abline(a=coefficients(ols)[1], b=coefficients(ols)[2]) | ||
</nowiki> | </nowiki> |
Revision as of 21:57, 14 August 2012
Project name | Main project page Next entry |
Linear regression by ordinary least squares
In matrix notation: y = Xθ + ε with ε˜N_{N}(0,σ^{2}I_{N}) and θ^{T} = (μ,β)
Here is the ordinary-least-square (OLS) estimator of θ:
Let's now define 4 summary statistics, very easy to compute:
This allows to obtain the estimate of the effect size only by having the summary statistics available:
The same works for the estimate of the standard deviation of the errors:
We can also benefit from this for the standard error of the parameters:
V(y) = V(μ + βg + ε) = V(μ) + V(βg) + V(ε) = β^{2}V(g) + σ^{2} The most intuitive way to simulate data is therefore to fix the proportion of variance in y explained by the genotype, for instance PVE = 60%, as well as the standard deviation of the errors, typically σ = 1. From this, we can calculate the corresponding effect size β of the genotype:
Therefore: Note that g is the random variable corresponding to the genotype encoded in allele dose, such that it is equal to 0, 1 or 2 copies of the minor allele. For our simulation, we will fix the minor allele frequency f (eg. f = 0.3) and we will assume Hardy-Weinberg equilibrium. Then g is distributed according to a binomial distribution with 2 trials for which the probability of success is f. As a consequence, its variance is V(g) = 2f(1 − f). Here is some R code implementing all this: set.seed(1859) N <- 100 # sample size mu <- 4 pve <- 0.6 sigma <- 1 maf <- 0.3 beta <- sigma * sqrt(pve / ((1 - pve) * 2 * maf * (1 - maf))) # 1.88 g <- sample(x=0:2, size=N, replace=TRUE, prob=c(maf^2, 2*maf*(1-maf), (1-maf)^2)) y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma) ols <- lm(y ~ g) summary(ols) # muhat=3.5, betahat=2.1, R2=0.64 sqrt(mean(ols$residuals^2)) # sigmahat = 0.98 plot(x=0, type="n", xlim=range(g), ylim=range(y), xlab="genotypes", ylab="phenotypes", main="Simple linear regression") for(i in unique(g)) points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19) abline(a=coefficients(ols)[1], b=coefficients(ols)[2])
As above, we want , and . To efficiently get them, we start with the singular value decomposition of X: X = UDV^{T} This allows us to get the Moore-Penrose pseudoinverse matrix of X: X^{ + } = (X^{T}X)^{ − 1}X^{T} X^{ + } = VD^{ − 1}U^{T} From this, we get the OLS estimate of the effect sizes:
Then it's straightforward to get the residuals:
With them we can calculate the estimate of the error variance:
And finally the standard errors of the estimates of the effect sizes:
We can check this with some R code: ## simulate the data set.seed(1859) N <- 100 mu <- 5 Xg <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # genotypes beta.g <- 0.5 Xc <- sample(x=0:1, size=N, replace=TRUE, prob=c(0.7, 0.3)) # gender beta.c <- 0.3 pve <- 0.8 betas.gc.bar <- mean(beta.g * Xg + beta.c * Xc) # 0.405 sigma <- sqrt((1/N) * sum((beta.g * Xg + beta.c * Xc - betas.gc.bar)^2) * (1-pve) / pve) # 0.2 y <- mu + beta.g * Xg + beta.c * Xc + rnorm(n=N, mean=0, sd=sigma) ## perform the OLS analysis with the SVD of X X <- cbind(rep(1,N), Xg, Xc) Xp <- svd(x=X) B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y E.hat <- y - X %*% B.hat sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 0.211 var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v) sqrt(diag(var.theta.hat)) # 0.0304 0.0290 0.0463 ## check all this ols <- lm(y ~ Xg + Xc) summary(ols) # muhat=4.99+-0.03, beta.g.hat=0.52+--.-29, beta.c.hat=0.24+-0.046, R2=0.789 Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL (here). |