|Project name|| Main project page|
Previous entry Next entry
About low-rank matrix decomposition
(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)
Let be X is the data matrix. It is a N x P matrix recording the measurements made on N samples (in rows) at P variables (in columns). This means that Xnp corresponds to the measurement of the pth variable on the nth sample, with and .
Concretely, what the description of I. Jolliffe above means is the following. By doing a PCA of X, we build P linear combinations of the original P variables. These linear combinations are called "principal components" (PCs). Most importantly, we hope that the first K PCs account for most of the total sample variance, and that ). If this is the case, we can confidently approximate X by the first K columns of Y, thereby discarding the measurements at the remaining K-P variables.
In matrix notation, doing the PCA of X boils down to applying a linear transformation on X. More precisely, this transformation is a rotation in the directions of maximum variance in the vector space of the P variables. As a result of the PCA, a new matrix Y is created such that Y = XA, with Y a N x P matrix with the rotated data, and A a P x P matrix whose columns are the principal components.
First, we need to find PC1, ie. the first column of the matrix A. According to what is written above (Y=XA), we can write down the N elements of the first column of Y:
Let's define the (column) vector y1 as the first column of Y and the (column) vector a1 as the first PC (first column of A). In matrix notation, we can re-write the system of equations above as y1 = Xa1.
We therefore need to find a1. For this, remember that we want PC1 to account for most of the total variance of X. This translates mathematically into maximizing V[y1], which in turns boils down to maximizing each element of V[y1], ie. V[Y11], ..., V[Yn1], ..., V[YN1].
Let's do it for V[Yn1]. If we define the (row) vector xn to be the nth row of X, then in matrix notation we have V[Yn1] = V[xna1]. However, it's more common to manipulate only column vectors, so let's define the (column) vector xn to be the nth row of X. This means that . Now let be Σn the covariance matrix of xn. This allows us to develop into .
Without anything else, it's a bit odd to look for a1 that maximizes such a formula. Indeed we need to add a constraint, for instance: (squared Euclidean norm, inner product). Such a constraint can be enforced using a Lagrange multiplier λ. At the end, what we want is to find a1 that maximizes .
For this, we differentiate the above formula with respect to a1 and set it to 0: . This shows us that λ is an eigenvalue of Σn and that a1 is the corresponding eigenvector.
However, which eigenvector to choose? Well, the one that corresponds to the largest eigenvalue because the later happens to be exactly the variance of each element of PC1. Indeed: .
Now that we have the first PC, we want the second one. We do the same as above, but with the additional constraint that PC2 has to be uncorrelated with PC1, ie. Cov[y2,y1] = 0. Here again, let's focus on the nth element of the PCs and let's develop the covariance: . Therefore, using two Lagrange multipliers, we now have to find a2 that maximizes . Again, we differentiate w.r.t. to a2 and set to 0: Σna2 − λ2a2 − φa1 = 0. When we multiply on the left by we get: . We can thus write Σna2 = λ2a2 and find a2 as the eigenvector corresponding to the second largest eigenvalue of Σn.
By doing this again, we can find all the principal components.
to do ...
As any rectangular matrix, we can factorize X using the singular value decomposition (SVD): X = UDVT
where U is a NxN unitary matrix, D is a NxP diagonal matrix and V is a PxP unitary matrix. The columns of U are the left singular vectors of X, the columns of V are the right singular vectors of X and D contains the singular values of X.
The transformed data are therefore Y = XV.
First, let's simulate data such as expression levels at 2 genes (the variables) in 100 individuals (the samples):
set.seed(1859) N <- 100 P <- 2 exp.gene1 <- rnorm(N, mean=3) exp.gene2 <- exp.gene1 + rnorm(N) dat <- cbind(exp.gene1, exp.gene2) plot(dat[,1], dat[,2], xlab="gene 1", ylab="gene 2", main="Original data")
Apply PCA using the the built-in "prcomp" function:
res.pca <- prcomp(dat, center=TRUE, scale.=FALSE) summary(res.pca) Importance of components: PC1 PC2 Standard deviation 1.5848 0.6121 Proportion of Variance 0.8702 0.1298 Cumulative Proportion 0.8702 1.0000
Visualize the results:
plot(res.pca, main="Variances of each PC") axis(side=1, at=c(0.8,2), labels=c(paste("PC1=", format(100*summary(res.pca)$importance[2,"PC1"], digits=2), "%", sep=""), paste("PC2=", format(100*summary(res.pca)$importance[2,"PC2"], digits=2), "%", sep="")), tick=FALSE) plot(res.pca$x[,1], res.pca$x[,2], xlab="PC1", ylab="PC2", main="PC loadings") mtext("each point is a bivariate observation plotted with the PCs as axes", line=0.4)
to continue ...
n individuals genotyped at L loci
Zsi: allelic state of individual i at SNP s,
Z: L x n binary matrix
Usually we center the data (Z), here by removing the mean of the allelic states at each SNP (mean of each row)
It is also common to standardize the rows so that they have unit variance.
We want to approximate the data in X by a matrix Y such that Y = PX.