User:Timothee Flutre/Notebook/Postdoc/2012/03/13
From OpenWetWare
Main project page Previous entry Next entry
| |
About low-rank matrix decomposition(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)
Let be X is the data matrix. It is a N x P matrix recording the measurements made on N samples (in rows) at P variables (in columns). This means that Xnp corresponds to the measurement of the pth variable on the nth sample, with Concretely, what the description of I. Jolliffe above means is the following. By doing a PCA of X, we build P linear combinations of the original P variables. These linear combinations are called "principal components" (PCs). Most importantly, we hope that the first K PCs account for most of the total sample variance, and that In matrix notation, doing the PCA of X boils down to applying a linear transformation on X. More precisely, this transformation is a rotation in the directions of maximum variance in the vector space of the P variables. As a result of the PCA, a new matrix Y is created such that Y = XA, with Y a N x P matrix with the rotated data, and A a P x P matrix whose columns are the principal components.
First, we need to find PC1, ie. the first column of the matrix A. According to what is written above (Y=XA), we can write down the N elements of the first column of Y:
Let's define the (column) vector y1 as the first column of Y and the (column) vector a1 as the first PC (first column of A). In matrix notation, we can re-write the system of equations above as y1 = Xa1. We therefore need to find a1. For this, remember that we want PC1 to account for most of the total variance of X. This translates mathematically into maximizing V[y1], which in turns boils down to maximizing each element of V[y1], ie. V[Y11], ..., V[Yn1], ..., V[YN1]. Let's do it for V[Yn1]. If we define the (row) vector xn to be the nth row of X, then in matrix notation we have V[Yn1] = V[xna1]. However, it's more common to manipulate only column vectors, so let's define the (column) vector xn to be the nth row of X. This means that Without anything else, it's a bit odd to look for a1 that maximizes such a formula. Indeed we need to add a constraint, for instance: For this, we differentiate the above formula with respect to a1 and set it to 0: However, which eigenvector to choose? Well, the one that corresponds to the largest eigenvalue because the later happens to be exactly the variance of each element of PC1. Indeed: Now that we have the first PC, we want the second one. We do the same as above, but with the additional constraint that PC2 has to be uncorrelated with PC1, ie. Cov[y2,y1] = 0. Here again, let's focus on the nth element of the PCs and let's develop the covariance: By doing this again, we can find all the principal components.
to do ... As any rectangular matrix, we can factorize X using the singular value decomposition (SVD): X = UDVT where U is a NxN unitary matrix, D is a NxP diagonal matrix and V is a PxP unitary matrix. The columns of U are the left singular vectors of X, the columns of V are the right singular vectors of X and D contains the singular values of X. The transformed data are therefore Y = XV.
First, let's simulate data such as expression levels at 2 genes (the variables) in 100 individuals (the samples): set.seed(1859) N <- 100 P <- 2 exp.gene1 <- rnorm(N, mean=3) exp.gene2 <- exp.gene1 + rnorm(N) dat <- cbind(exp.gene1, exp.gene2) plot(dat[,1], dat[,2], xlab="gene 1", ylab="gene 2", main="Original data") Apply PCA using the the built-in "prcomp" function: res.pca <- prcomp(dat, center=TRUE, scale.=FALSE)
summary(res.pca)
Importance of components:
PC1 PC2
Standard deviation 1.5848 0.6121
Proportion of Variance 0.8702 0.1298
Cumulative Proportion 0.8702 1.0000
Visualize the results: plot(res.pca, main="Variances of each PC")
axis(side=1, at=c(0.8,2), labels=c(paste("PC1=", format(100*summary(res.pca)$importance[2,"PC1"], digits=2), "%", sep=""),
paste("PC2=", format(100*summary(res.pca)$importance[2,"PC2"], digits=2), "%", sep="")),
tick=FALSE)
plot(res.pca$x[,1], res.pca$x[,2], xlab="PC1", ylab="PC2", main="PC loadings")
mtext("each point is a bivariate observation plotted with the PCs as axes", line=0.4)
to continue ...
n individuals genotyped at L loci Zsi: allelic state of individual i at SNP s, Z: L x n binary matrix Usually we center the data (Z), here by removing the mean of the allelic states at each SNP (mean of each row)
It is also common to standardize the rows so that they have unit variance. We want to approximate the data in X by a matrix Y such that Y = PX. | |

and
.
). If this is the case, we can confidently approximate X by the first K columns of Y, thereby discarding the measurements at the remaining K-P variables.
. Now let be
into
.
(squared Euclidean norm, inner product). Such a constraint can be enforced using a Lagrange multiplier
.
. This shows us that
.
. Therefore, using two Lagrange multipliers, we now have to find
. Again, we differentiate w.r.t. to
we get:
. We can thus write


