User:Timothee Flutre/Notebook/Postdoc/2011/06/28
From OpenWetWare
m (→Simple linear regression) |
(→Simple linear regression: add regression with multiple predictors) |
||
Line 6: | Line 6: | ||
| colspan="2"| | | colspan="2"| | ||
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | ||
- | == | + | ==Linear regression by ordinary least squares== |
- | * '''Data''': | + | * '''Data''': let's assume that we obtained data from <math>N</math> individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes (eg. expression level at a given gene), and <math>g_1,\ldots,g_N</math> the genotypes at a given SNP. We want to assess their linear relationship. |
- | * '''Model''': | + | * '''Model''': to start with, we use a simple linear regression (univariate phenotype, single predictor). |
<math>\forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2)</math> | <math>\forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2)</math> | ||
Line 74: | Line 74: | ||
<math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}</math> | <math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}</math> | ||
+ | |||
* '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case (linear regression: <math>y = \mu + \beta g + \epsilon</math>), it is frequent to fix the proportion of variance in <math>y</math> explained by <math>\beta g</math>: | * '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case (linear regression: <math>y = \mu + \beta g + \epsilon</math>), it is frequent to fix the proportion of variance in <math>y</math> explained by <math>\beta g</math>: | ||
Line 105: | Line 106: | ||
</nowiki> | </nowiki> | ||
+ | |||
+ | * '''Several predictors''': let's now imagine that we also know the gender of the N sampled individuals. We hence want to account for that in our estimate of the genotypic effect. In matrix notation, we still have the same model Y = XB + E with Y an Nx1 vector, X an Nx3 matrix with 1's in the first column, the genotypes in the second and the genders in the third, B a 3x1 vector and E an Nx1 vector following a multivariate Normal distribution centered on 0 and with covariance matrix <math>\sigma^2 I_N</math>. | ||
+ | |||
+ | As above, we want <math>\hat{B}</math>, <math>\hat{\sigma}</math> and <math>V(\hat{B})</math>. To efficiently get them, we start with the [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] of X: | ||
+ | |||
+ | <math>X = U D V^T</math> | ||
+ | |||
+ | This allows us to get the [http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse Moore-Penrose pseudoinverse] matrix of X: | ||
+ | |||
+ | <math>X^+ = (X^TX)^{-1}X^T</math> | ||
+ | |||
+ | <math>X^+ = V D^{-1} U^T</math> | ||
+ | |||
+ | From this, we get the OLS estimate of the effect sizes: | ||
+ | |||
+ | <math>\hat{B} = X^+ Y</math> | ||
+ | |||
+ | Then it's straightforward to get the residuals: | ||
+ | |||
+ | <math>\hat{E} = Y - X \hat{B}</math> | ||
+ | |||
+ | With them we can calculate the estimate of the error variance: | ||
+ | |||
+ | <math>\hat{\sigma} = \sqrt{\frac{1}{N-3} \hat{E}^T \hat{E}}</math> | ||
+ | |||
+ | And finally the standard errors of the estimates of the effect sizes: | ||
+ | |||
+ | <math>V(\hat{B}) = \hat{\sigma}^2 V D^{-2} V^T</math> | ||
+ | |||
+ | We can check this with some R code: | ||
+ | |||
+ | <nowiki> | ||
+ | ## simulate the data | ||
+ | set.seed(1859) | ||
+ | N <- 100 | ||
+ | mu <- 5 | ||
+ | Xg <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # genotypes | ||
+ | beta.g <- 0.5 | ||
+ | Xc <- sample(x=0:1, size=N, replace=TRUE, prob=c(0.7, 0.3)) # gender | ||
+ | beta.c <- 0.3 | ||
+ | pve <- 0.8 | ||
+ | betas.gc.bar <- mean(beta.g * Xg + beta.c * Xc) # 0.405 | ||
+ | sigma <- sqrt((1/N) * sum((beta.g * Xg + beta.c * Xc - betas.gc.bar)^2) * | ||
+ | (1-pve) / pve) # 0.2 | ||
+ | y <- mu + beta.g * Xg + beta.c * Xc + rnorm(n=N, mean=0, sd=sigma) | ||
+ | |||
+ | ## perform the OLS analysis with the SVD of X | ||
+ | X <- cbind(rep(1,N), Xg, Xc) | ||
+ | Xp <- svd(x=X) | ||
+ | B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y | ||
+ | E.hat <- y - X %*% B.hat | ||
+ | sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 0.211 | ||
+ | var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v) | ||
+ | sqrt(diag(var.theta.hat)) # 0.0304 0.0290 0.0463 | ||
+ | |||
+ | ## check all this | ||
+ | ols <- lm(y ~ Xg + Xc) | ||
+ | summary(ols) # muhat=4.99+-0.03, beta.g.hat=0.52+--.-29, beta.c.hat=0.24+-0.046, R2=0.789 | ||
+ | </nowiki> | ||
+ | |||
+ | Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL ([http://www.gnu.org/software/gsl/manual/html_node/Multi_002dparameter-fitting.html here]). | ||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Revision as of 15:35, 30 July 2012
Project name | Main project page Next entry |
Linear regression by ordinary least squares
In matrix notation: y = Xθ + ε with ε˜N_{N}(0,σ^{2}I_{N}) and θ^{T} = (μ,β)
Here is the ordinary-least-square (OLS) estimator of θ:
Let's now define 4 summary statistics, very easy to compute:
This allows to obtain the estimate of the effect size only by having the summary statistics available:
The same works for the estimate of the standard deviation of the errors:
We can also benefit from this for the standard error of the parameters:
with and V(ε) = σ^{2} This way, by also fixing β, it is easy to calculate the corresponding σ:
Here is some R code implementing this: set.seed(1859) N <- 100 mu <- 5 g <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # MAF=0.2 beta <- 0.5 pve <- 0.8 beta.g.bar <- mean(beta * g) sigma <- sqrt((1/N) * sum((beta * g - beta.g.bar)^2) * (1-pve) / pve) # 0.18 y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma) plot(x=0, type="n", xlim=range(g), ylim=range(y), xlab="genotypes (allele dose)", ylab="phenotypes", main="Simple linear regression") for(i in unique(g)) points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19) ols <- lm(y ~ g) summary(ols) # muhat=5.01, betahat=0.46, R2=0.779 abline(a=coefficients(ols)[1], b=coefficients(ols)[2])
As above, we want , and . To efficiently get them, we start with the singular value decomposition of X: X = UDV^{T} This allows us to get the Moore-Penrose pseudoinverse matrix of X: X^{ + } = (X^{T}X)^{ − 1}X^{T} X^{ + } = VD^{ − 1}U^{T} From this, we get the OLS estimate of the effect sizes:
Then it's straightforward to get the residuals:
With them we can calculate the estimate of the error variance:
And finally the standard errors of the estimates of the effect sizes:
We can check this with some R code: ## simulate the data set.seed(1859) N <- 100 mu <- 5 Xg <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # genotypes beta.g <- 0.5 Xc <- sample(x=0:1, size=N, replace=TRUE, prob=c(0.7, 0.3)) # gender beta.c <- 0.3 pve <- 0.8 betas.gc.bar <- mean(beta.g * Xg + beta.c * Xc) # 0.405 sigma <- sqrt((1/N) * sum((beta.g * Xg + beta.c * Xc - betas.gc.bar)^2) * (1-pve) / pve) # 0.2 y <- mu + beta.g * Xg + beta.c * Xc + rnorm(n=N, mean=0, sd=sigma) ## perform the OLS analysis with the SVD of X X <- cbind(rep(1,N), Xg, Xc) Xp <- svd(x=X) B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y E.hat <- y - X %*% B.hat sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 0.211 var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v) sqrt(diag(var.theta.hat)) # 0.0304 0.0290 0.0463 ## check all this ols <- lm(y ~ Xg + Xc) summary(ols) # muhat=4.99+-0.03, beta.g.hat=0.52+--.-29, beta.c.hat=0.24+-0.046, R2=0.789 Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL (here). |