User:Timothee Flutre/Notebook/Postdoc/2012/03/13: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 10: Line 10:
''(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)''
''(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)''


* '''[http://www.citeulike.org/user/timflutre/article/5952306 article] "A Genealogical Interpretation of Principal Components Analysis" from McVean (PLoS Genetics 2009)'''


n individuals genotyped at L loci
* from the [http://www.citeulike.org/user/timflutre/article/1154147 book] '''"Principal component analysis" by I. Jolliffe''': "The central idea of principal component analysis is to reduce the dimensionality of a data set in which there are a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This reduction is achieved by transforming to a new set of variables, the principal components, which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables.".
 
 
* '''Theoretical explanation:'''
 
X is a N x P matrix of data with measurements for N samples (in rows) at P variables (in columns).
 
As any rectangular matrix, we can factorize X using the singular value decomposition ([http://en.wikipedia.org/wiki/Singular_value_decomposition SVD]): <math>X = U D V^T</math>
 
where U is a NxN unitary matrix, D is a NxP diagonal matrix and V is aPxP unitary matrix. The columns of U are the left singular vectors of X, the columns of V are the right singular vectors of X and D contains the singular values of X.
 
The transformed data are therefore: <math>Y = X V = U D</math>
 
to continue ...
 
 
* '''Example in R:'''
 
First, let's simulate data such as expression levels at 2 genes (the variables) in 100 individuals (the samples):
 
set.seed(1859)
N <- 100
P <- 2
exp.gene1 <- rnorm(N, mean=3)
exp.gene2 <- exp.gene1 + rnorm(N)
dat <- cbind(exp.gene1, exp.gene2)
plot(dat[,1], dat[,2], xlab="gene 1", ylab="gene 2", main="Original data")


<math>Z_{si}</math>: allelic state of individual i at SNP s, <math>\in {0,1}</math>
Apply PCA using the the built-in "prcomp" function:


Z: L x n binary matrix
res.pca <- prcomp(dat, center=TRUE, scale.=FALSE)
summary(res.pca)
Importance of components:
                          PC1    PC2
Standard deviation    1.5848 0.6121
Proportion of Variance 0.8702 0.1298
Cumulative Proportion  0.8702 1.0000


** '''Preprocessing''': usally we center the data (Z), here by removing the mean of the allelic states at each SNP (mean of each row)
Visualize the results:


<math>X_{si} = Z_{si} - \frac{1}{n} \sum_{j=1}^n Z_{sj}</math>
plot(res.pca, main="Variances of each PC")
axis(side=1, at=c(0.8,2), labels=c(paste("PC1=", format(100*summary(res.pca)$importance[2,"PC1"], digits=2), "%", sep=""),
                            paste("PC2=", format(100*summary(res.pca)$importance[2,"PC2"], digits=2), "%", sep="")),
      tick=FALSE)
plot(res.pca$x[,1], res.pca$x[,2], xlab="PC1", ylab="PC2", main="PC loadings")
mtext("each point is a bivariate observation plotted with the PCs as axes", line=0.4)


It is also common to standardize the rows so that they have unit variance.
to continue ...


** '''Apply PCA''': "the goal is to find a new set of orthogonal axes (the principal components, PCs), each of which is made up from a linear combination of the original axes, such that the projection of the original data onto these new axes leads to an efficient summary of the structure of the data".


We want to approximate the data in X by a matrix Y such that <math>Y = P X</math>. The rows of P are the principal components (PCs). The entries of each PC are called "loadings".
* from the '''[http://www.citeulike.org/user/timflutre/article/5952306 article] "A Genealogical Interpretation of Principal Components Analysis" by G. McVean'''


** '''PCA via eigendecomposition''': the PCs (ie. the loadings) can be obtained by doing an [http://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix eigendecomposition] of the matrix <math>C = \frac{1}{n-1} X X^T</math>. This matrix is the covariance matrix if the rows of Z are not standardized, otherwise it is the correlation matrix. The <math>i^{th}</math>
n individuals genotyped at L loci


** '''Example in R''': no specific model is assumed here, just random data (to be improved...)
<math>Z_{si}</math>: allelic state of individual i at SNP s, <math>\in {0,1}</math>


L <- 1000
Z: L x n binary matrix
n <- 100
Z <- matrix(rnorm(L*n), nrow=L, ncol=n)


# compare PC loadings from built-in and custom methods with centering and standardization
Usually we center the data (Z), here by removing the mean of the allelic states at each SNP (mean of each row)
res.pca <- prcomp(Z, center=TRUE, scale.=TRUE)
X <- Z - rowMeans(Z)
res.eig <- eigen(cor(X))
range(res.pca$sdev - sqrt(res.eig$values)) # should be close to 0


# with centering but no standardization
<math>X_{si} = Z_{si} - \frac{1}{n} \sum_{j=1}^n Z_{sj}</math>
res.pca <- prcomp(Z, center=TRUE, scale.=FALSE)
 
X <- Z - rowMeans(Z)
It is also common to standardize the rows so that they have unit variance.
res.eig <- eigen(cov(X))
range(res.pca$sdev - sqrt(res.eig$values))  # should be close to 0


# with neither centering nor standardization
We want to approximate the data in X by a matrix Y such that <math>Y = P X</math>.
res.pca <- prcomp(Z, center=FALSE, scale.=FALSE)
res.eig <- eigen(cov(Z))
range(res.pca$sdev - sqrt(res.eig$values))  # should be close to 0


<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

Revision as of 14:44, 15 March 2012

Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

About low-rank matrix decomposition

(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)


  • from the book "Principal component analysis" by I. Jolliffe: "The central idea of principal component analysis is to reduce the dimensionality of a data set in which there are a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This reduction is achieved by transforming to a new set of variables, the principal components, which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables.".


  • Theoretical explanation:

X is a N x P matrix of data with measurements for N samples (in rows) at P variables (in columns).

As any rectangular matrix, we can factorize X using the singular value decomposition (SVD): [math]\displaystyle{ X = U D V^T }[/math]

where U is a NxN unitary matrix, D is a NxP diagonal matrix and V is aPxP unitary matrix. The columns of U are the left singular vectors of X, the columns of V are the right singular vectors of X and D contains the singular values of X.

The transformed data are therefore: [math]\displaystyle{ Y = X V = U D }[/math]

to continue ...


  • Example in R:

First, let's simulate data such as expression levels at 2 genes (the variables) in 100 individuals (the samples):

set.seed(1859)
N <- 100
P <- 2
exp.gene1 <- rnorm(N, mean=3)
exp.gene2 <- exp.gene1 + rnorm(N)
dat <- cbind(exp.gene1, exp.gene2)
plot(dat[,1], dat[,2], xlab="gene 1", ylab="gene 2", main="Original data")

Apply PCA using the the built-in "prcomp" function:

res.pca <- prcomp(dat, center=TRUE, scale.=FALSE)
summary(res.pca)
Importance of components:
                          PC1    PC2
Standard deviation     1.5848 0.6121
Proportion of Variance 0.8702 0.1298
Cumulative Proportion  0.8702 1.0000

Visualize the results:

plot(res.pca, main="Variances of each PC")
axis(side=1, at=c(0.8,2), labels=c(paste("PC1=", format(100*summary(res.pca)$importance[2,"PC1"], digits=2), "%", sep=""),
                            paste("PC2=", format(100*summary(res.pca)$importance[2,"PC2"], digits=2), "%", sep="")),
     tick=FALSE)
plot(res.pca$x[,1], res.pca$x[,2], xlab="PC1", ylab="PC2", main="PC loadings")
mtext("each point is a bivariate observation plotted with the PCs as axes", line=0.4)

to continue ...


  • from the article "A Genealogical Interpretation of Principal Components Analysis" by G. McVean

n individuals genotyped at L loci

[math]\displaystyle{ Z_{si} }[/math]: allelic state of individual i at SNP s, [math]\displaystyle{ \in {0,1} }[/math]

Z: L x n binary matrix

Usually we center the data (Z), here by removing the mean of the allelic states at each SNP (mean of each row)

[math]\displaystyle{ X_{si} = Z_{si} - \frac{1}{n} \sum_{j=1}^n Z_{sj} }[/math]

It is also common to standardize the rows so that they have unit variance.

We want to approximate the data in X by a matrix Y such that [math]\displaystyle{ Y = P X }[/math].