User:Timothee Flutre/Notebook/Postdoc/2011/06/28
From OpenWetWare
(→Linear regression by ordinary least squares: use rbinom with MAF when simulating genotypes + better PVE for multiple regression) |
(→Linear regression by ordinary least squares: improve notation) |
||
Line 8: | Line 8: | ||
==Linear regression by ordinary least squares== | ==Linear regression by ordinary least squares== | ||
- | * '''Data''': let's assume that we obtained data from <math>N</math> individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes ( | + | * '''Data''': let's assume that we obtained data from <math>N</math> individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes (e.g. expression level at a given gene), and <math>g_1,\ldots,g_N</math> the genotypes at a given SNP. We want to assess their linear relationship. |
* '''Model''': to start with, we use a simple linear regression (univariate phenotype, single predictor). | * '''Model''': to start with, we use a simple linear regression (univariate phenotype, single predictor). | ||
- | <math>\forall | + | <math>\forall i \in \{1,\ldots,N\}, \; y_i = \mu + \beta g_i + \epsilon_i \text{ with } \epsilon_i \sim \mathcal{N}(0,\sigma^2)</math> |
- | In matrix notation: | + | In vector-matrix notation: |
- | <math>y = X | + | <math>\vec{y} = X B + \vec{e}</math> with <math>\vec{e} \sim \mathcal{N}_N(0,\sigma^2 I_N)</math> and <math>B^T = [\mu \beta]</math> |
- | + | The parameters of the model are: <math>\Theta = \{\mu, \beta, \sigma\}</math> | |
- | + | * '''Use only summary statistics''': most importantly, we want the following estimates: <math>\hat{\beta}</math>, <math>\hat{\sigma}</math> and <math>se(\hat{\beta})</math> (its standard error). In the case where we don't have access to the original data (e.g. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates. | |
- | <math>\hat{ | + | The well-known ordinary-least-square ([http://en.wikipedia.org/wiki/Ordinary_least_squares#Estimation OLS]) estimator of <math>B</math> is: |
+ | |||
+ | <math>\hat{B} = (X^T X)^{-1} X^T \vec{y}</math> | ||
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = | <math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = | ||
Line 32: | Line 34: | ||
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = | <math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = | ||
- | \begin{bmatrix} N & \ | + | \begin{bmatrix} N & \sum_i g_i \\ \sum_i g_i & \sum_i g_i^2 \end{bmatrix}^{-1} |
- | \begin{bmatrix} \ | + | \begin{bmatrix} \sum_i y_i \\ \sum_i g_i y_i \end{bmatrix} |
</math> | </math> | ||
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = | <math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = | ||
- | \frac{1}{N \ | + | \frac{1}{N \sum_i g_i^2 - (\sum_i g_i)^2} |
- | \begin{bmatrix} \ | + | \begin{bmatrix} \sum_i g_i^2 & - \sum_i g_i \\ - \sum_i g_i & N \end{bmatrix} |
- | \begin{bmatrix} \ | + | \begin{bmatrix} \sum_i y_i \\ \sum_i g_i y_i \end{bmatrix} |
</math> | </math> | ||
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = | <math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = | ||
- | \frac{1}{N \ | + | \frac{1}{N \sum_i g_i^2 - (\sum_i g_i)^2} |
- | \begin{bmatrix} \ | + | \begin{bmatrix} \sum_i g_i^2 \sum_i y_i - \sum_i g_i \sum_i g_i y_i \\ - \sum_i g_i \sum_i y_i + N \sum_i g_i y_i \end{bmatrix} |
</math> | </math> | ||
- | Let's now define 4 summary statistics, very easy to compute: | + | Let's now define 4 '''summary statistics''', very easy to compute: |
- | <math>\bar{y} = \frac{1}{N} \sum_{ | + | <math>\bar{y} = \frac{1}{N} \sum_{i=1}^N y_i</math> |
- | <math>\bar{g} = \frac{1}{N} \sum_{ | + | <math>\bar{g} = \frac{1}{N} \sum_{i=1}^N g_i</math> |
- | <math>g^T g = \sum_{ | + | <math>\vec{g}^T \vec{g} = \sum_{i=1}^N g_i^2</math> |
- | <math>g^T y = \sum_{ | + | <math>\vec{g}^T \vec{y} = \sum_{i=1}^N g_i y_i</math> |
This allows to obtain the estimate of the effect size only by having the summary statistics available: | This allows to obtain the estimate of the effect size only by having the summary statistics available: | ||
- | <math>\hat{\beta} = \frac{g^T y - N \bar{g} \bar{y}}{g^T g - N \bar{g}^2}</math> | + | <math>\hat{\beta} = \frac{\vec{g}^T \vec{y} - N \bar{g} \bar{y}}{\vec{g}^T \vec{g} - N \bar{g}^2}</math> |
+ | |||
+ | By the way, in this case (i.e. simple linear regression, a single predictor), it's easy to see that: | ||
+ | |||
+ | <math>\hat{\beta} = \frac{Cov[\vec{y},\vec{g}]}{Var[\vec{g}]}</math> | ||
The same works for the estimate of the standard deviation of the errors: | The same works for the estimate of the standard deviation of the errors: | ||
- | <math>\hat{\sigma}^2 = \frac{1}{N-r}(y - X\hat{ | + | <math>\hat{\sigma}^2 = \frac{1}{N-r}(\vec{y} - X\hat{B})^T(\vec{y} - X\hat{B})</math> |
We can also benefit from this for the standard error of the parameters: | We can also benefit from this for the standard error of the parameters: | ||
- | <math>V(\hat{ | + | <math>V(\hat{B}) = \hat{\sigma}^2 (X^T X)^{-1}</math> |
- | <math>V(\hat{ | + | <math>V(\hat{B}) = \hat{\sigma}^2 \frac{1}{N \vec{g}^T \vec{g} - N^2 \bar{g}^2} |
- | \begin{bmatrix} g^ | + | \begin{bmatrix} \vec{g}^T\vec{g} & -N\bar{g} \\ -N\bar{g} & N \end{bmatrix} |
</math> | </math> | ||
- | <math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^ | + | <math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{\vec{g}^T\vec{g} - N\bar{g}^2}</math> |
+ | |||
+ | which corresponds to: | ||
+ | |||
+ | <math>V(\hat{\beta}) = \frac{1}{N} \frac{Var[\vec{e}]}{Var[\vec{g}]}</math> | ||
Line 113: | Line 123: | ||
- | * '''Several predictors''': let's now imagine that we also know the gender of the N sampled individuals. We hence want to account for that in our estimate of the | + | * '''Several predictors''': let's now imagine that we also know the gender of the N sampled individuals. We hence want to account for that in our estimate of the effect of the genotype. In matrix notation, we still have the same model, <math>\vec{y} = XB + \vec{e}</math> with <math>\vec{y}</math> an Nx1 vector, X an Nx3 matrix with 1's in the first column, the genotypes in the second and the genders in the third, B a 3x1 vector and <math>\vec{e}</math> an Nx1 vector following a multivariate Normal distribution centered on 0 and with covariance matrix <math>\sigma^2 I_N</math>. |
As above, we want <math>\hat{B}</math>, <math>\hat{\sigma}</math> and <math>V(\hat{B})</math>. To efficiently get them, we start with the [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] of X: | As above, we want <math>\hat{B}</math>, <math>\hat{\sigma}</math> and <math>V(\hat{B})</math>. To efficiently get them, we start with the [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] of X: | ||
Line 121: | Line 131: | ||
This allows us to get the [http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse Moore-Penrose pseudoinverse] matrix of X: | This allows us to get the [http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse Moore-Penrose pseudoinverse] matrix of X: | ||
- | <math>X^+ = (X^TX)^{-1}X^T | + | <math>X^+ = (X^TX)^{-1}X^T = V D^{-1} U^T</math> |
- | + | ||
- | + | ||
From this, we get the OLS estimate of the effect sizes: | From this, we get the OLS estimate of the effect sizes: | ||
- | <math>\hat{B} = X^+ | + | <math>\hat{B} = X^+ \vec{y}</math> |
Then it's straightforward to get the residuals: | Then it's straightforward to get the residuals: | ||
- | <math>\hat{ | + | <math>\hat{\vec{e}} = \vec{y} - X \hat{B}</math> |
With them we can calculate the estimate of the error variance: | With them we can calculate the estimate of the error variance: | ||
- | <math>\hat{\sigma} = | + | <math>\hat{\sigma}^2 = \frac{1}{N-3} \hat{\vec{e}}^T \hat{\vec{e}}</math> |
And finally the standard errors of the estimates of the effect sizes: | And finally the standard errors of the estimates of the effect sizes: |
Revision as of 11:03, 21 November 2012
Project name | Main project page Next entry |
Linear regression by ordinary least squares
In vector-matrix notation: with and B^{T} = [μβ] The parameters of the model are: Θ = {μ,β,σ}
The well-known ordinary-least-square (OLS) estimator of B is:
Let's now define 4 summary statistics, very easy to compute:
This allows to obtain the estimate of the effect size only by having the summary statistics available:
By the way, in this case (i.e. simple linear regression, a single predictor), it's easy to see that:
The same works for the estimate of the standard deviation of the errors:
We can also benefit from this for the standard error of the parameters:
which corresponds to:
V(y) = V(μ + βg + ε) = V(μ) + V(βg) + V(ε) = β^{2}V(g) + σ^{2} The most intuitive way to simulate data is therefore to fix the proportion of variance in y explained by the genotype, for instance PVE = 60%, as well as the standard deviation of the errors, typically σ = 1. From this, we can calculate the corresponding effect size β of the genotype:
Therefore: Note that g is the random variable corresponding to the genotype encoded in allele dose, such that it is equal to 0, 1 or 2 copies of the minor allele. For our simulation, we will fix the minor allele frequency f (eg. f = 0.3) and we will assume Hardy-Weinberg equilibrium. Then g is distributed according to a binomial distribution with 2 trials for which the probability of success is f. As a consequence, its variance is V(g) = 2f(1 − f). Here is some R code implementing all this: set.seed(1859) N <- 100 # sample size mu <- 4 pve <- 0.6 sigma <- 1 maf <- 0.3 # minor allele frequency beta <- sigma * sqrt(pve / ((1 - pve) * 2 * maf * (1 - maf))) # 1.89 g <- rbinom(n=N, size=2, prob=maf) # assuming Hardy-Weinberg equilibrium y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma) ols <- lm(y ~ g) summary(ols) # muhat=4.1+-0.13, betahat=1.6+-0.16, R2=0.49 sqrt((1/(N-2) * sum(ols$residuals^2))) # sigmahat=0.99 plot(x=0, type="n", xlim=range(g), ylim=range(y), xlab="genotypes (allele counts)", ylab="phenotypes", main="Simple linear regression") for(i in unique(g)) points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19) abline(a=coefficients(ols)[1], b=coefficients(ols)[2])
As above, we want , and . To efficiently get them, we start with the singular value decomposition of X: X = UDV^{T} This allows us to get the Moore-Penrose pseudoinverse matrix of X: X^{ + } = (X^{T}X)^{ − 1}X^{T} = VD^{ − 1}U^{T} From this, we get the OLS estimate of the effect sizes:
Then it's straightforward to get the residuals:
With them we can calculate the estimate of the error variance:
And finally the standard errors of the estimates of the effect sizes:
We can check this with some R code: ## simulate the data set.seed(1859) N <- 100 mu <- 4 pve.g <- 0.4 # genotype pve.c <- 0.2 # other covariate, eg. gender sigma <- 1 maf <- 0.3 sex.ratio <- 0.5 beta.g <- sigma * sqrt((1 / (2 * maf * (1 - maf))) * (pve.g / (1 - pve.g - pve.c))) # 1.543 beta.c <- beta.g * sqrt((pve.c / pve.g) * (2 * maf * (1 - maf) / sex.ratio * (1 - sex.ratio))) # 0.707 x.g <- rbinom(n=N, size=2, prob=maf) x.c <- rbinom(n=N, size=1, prob=sex.ratio) y <- mu + beta.g * x.g + beta.c * x.c + rnorm(n=N, mean=0, sd=sigma) ols <- lm(y ~ x.g + x.c) summary(ols) # muhat=3.9+-0.17, beta.g.hat=1.6+-0.17, beta.c.hat=0.58+-0.21, R2=0.51 sqrt((1/(N-3)) * sum(ols$residuals^2)) # sigma.hat = 1.058 ## perform the OLS analysis with the SVD of X X <- cbind(rep(1,N), x.g, x.c) Xp <- svd(x=X) B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y E.hat <- y - X %*% B.hat sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 1.058 var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v) sqrt(diag(var.theta.hat)) # 0.168 0.175 0.212 Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL (here). |