User:Timothee Flutre/Notebook/Postdoc/2012/03/13: Difference between revisions
m (→About low-rank matrix decomposition: temp) |
(→About low-rank matrix decomposition: add details SVD) |
||
Line 10: | Line 10: | ||
''(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)'' | ''(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)'' | ||
* from the [http://www.citeulike.org/user/timflutre/article/1154147 book] '''"Principal component analysis" by I. Jolliffe''': "The central idea of principal component analysis is to reduce the dimensionality of a data set in which there are a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This reduction is achieved by transforming to a new set of variables, the principal components, which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables.". | |||
* '''Theoretical explanation:''' | |||
X is a N x P matrix of data with measurements for N samples (in rows) at P variables (in columns). | |||
As any rectangular matrix, we can factorize X using the singular value decomposition ([http://en.wikipedia.org/wiki/Singular_value_decomposition SVD]): <math>X = U D V^T</math> | |||
where U is a NxN unitary matrix, D is a NxP diagonal matrix and V is aPxP unitary matrix. The columns of U are the left singular vectors of X, the columns of V are the right singular vectors of X and D contains the singular values of X. | |||
The transformed data are therefore: <math>Y = X V = U D</math> | |||
to continue ... | |||
* '''Example in R:''' | |||
First, let's simulate data such as expression levels at 2 genes (the variables) in 100 individuals (the samples): | |||
set.seed(1859) | |||
N <- 100 | |||
P <- 2 | |||
exp.gene1 <- rnorm(N, mean=3) | |||
exp.gene2 <- exp.gene1 + rnorm(N) | |||
dat <- cbind(exp.gene1, exp.gene2) | |||
plot(dat[,1], dat[,2], xlab="gene 1", ylab="gene 2", main="Original data") | |||
Apply PCA using the the built-in "prcomp" function: | |||
res.pca <- prcomp(dat, center=TRUE, scale.=FALSE) | |||
summary(res.pca) | |||
Importance of components: | |||
PC1 PC2 | |||
Standard deviation 1.5848 0.6121 | |||
Proportion of Variance 0.8702 0.1298 | |||
Cumulative Proportion 0.8702 1.0000 | |||
Visualize the results: | |||
plot(res.pca, main="Variances of each PC") | |||
axis(side=1, at=c(0.8,2), labels=c(paste("PC1=", format(100*summary(res.pca)$importance[2,"PC1"], digits=2), "%", sep=""), | |||
paste("PC2=", format(100*summary(res.pca)$importance[2,"PC2"], digits=2), "%", sep="")), | |||
tick=FALSE) | |||
plot(res.pca$x[,1], res.pca$x[,2], xlab="PC1", ylab="PC2", main="PC loadings") | |||
mtext("each point is a bivariate observation plotted with the PCs as axes", line=0.4) | |||
to continue ... | |||
* from the '''[http://www.citeulike.org/user/timflutre/article/5952306 article] "A Genealogical Interpretation of Principal Components Analysis" by G. McVean''' | |||
n individuals genotyped at L loci | |||
<math>Z_{si}</math>: allelic state of individual i at SNP s, <math>\in {0,1}</math> | |||
Z: L x n binary matrix | |||
Usually we center the data (Z), here by removing the mean of the allelic states at each SNP (mean of each row) | |||
<math>X_{si} = Z_{si} - \frac{1}{n} \sum_{j=1}^n Z_{sj}</math> | |||
It is also common to standardize the rows so that they have unit variance. | |||
We want to approximate the data in X by a matrix Y such that <math>Y = P X</math>. | |||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Revision as of 14:44, 15 March 2012
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
About low-rank matrix decomposition(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)
X is a N x P matrix of data with measurements for N samples (in rows) at P variables (in columns). As any rectangular matrix, we can factorize X using the singular value decomposition (SVD): [math]\displaystyle{ X = U D V^T }[/math] where U is a NxN unitary matrix, D is a NxP diagonal matrix and V is aPxP unitary matrix. The columns of U are the left singular vectors of X, the columns of V are the right singular vectors of X and D contains the singular values of X. The transformed data are therefore: [math]\displaystyle{ Y = X V = U D }[/math] to continue ...
First, let's simulate data such as expression levels at 2 genes (the variables) in 100 individuals (the samples): set.seed(1859) N <- 100 P <- 2 exp.gene1 <- rnorm(N, mean=3) exp.gene2 <- exp.gene1 + rnorm(N) dat <- cbind(exp.gene1, exp.gene2) plot(dat[,1], dat[,2], xlab="gene 1", ylab="gene 2", main="Original data") Apply PCA using the the built-in "prcomp" function: res.pca <- prcomp(dat, center=TRUE, scale.=FALSE) summary(res.pca) Importance of components: PC1 PC2 Standard deviation 1.5848 0.6121 Proportion of Variance 0.8702 0.1298 Cumulative Proportion 0.8702 1.0000 Visualize the results: plot(res.pca, main="Variances of each PC") axis(side=1, at=c(0.8,2), labels=c(paste("PC1=", format(100*summary(res.pca)$importance[2,"PC1"], digits=2), "%", sep=""), paste("PC2=", format(100*summary(res.pca)$importance[2,"PC2"], digits=2), "%", sep="")), tick=FALSE) plot(res.pca$x[,1], res.pca$x[,2], xlab="PC1", ylab="PC2", main="PC loadings") mtext("each point is a bivariate observation plotted with the PCs as axes", line=0.4) to continue ...
n individuals genotyped at L loci [math]\displaystyle{ Z_{si} }[/math]: allelic state of individual i at SNP s, [math]\displaystyle{ \in {0,1} }[/math] Z: L x n binary matrix Usually we center the data (Z), here by removing the mean of the allelic states at each SNP (mean of each row) [math]\displaystyle{ X_{si} = Z_{si} - \frac{1}{n} \sum_{j=1}^n Z_{sj} }[/math] It is also common to standardize the rows so that they have unit variance. We want to approximate the data in X by a matrix Y such that [math]\displaystyle{ Y = P X }[/math]. |