User:Timothee Flutre/Notebook/Postdoc/2012/03/13
From OpenWetWare
Project name | Main project page Previous entry Next entry |
About low-rank matrix decomposition(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)
Let be X is the data matrix. It is a N x P matrix recording the measurements made on N samples (in rows) at P variables (in columns). This means that X_{np} corresponds to the measurement of the p^{th} variable on the n^{th} sample, with and . Concretely, what the description of I. Jolliffe above means is the following. By doing a PCA of X, we build P linear combinations of the original P variables. These linear combinations are called "principal components" (PCs). Most importantly, we hope that the first K PCs account for most of the total sample variance, and that ). If this is the case, we can confidently approximate X by the first K columns of Y, thereby discarding the measurements at the remaining K-P variables. In matrix notation, doing the PCA of X boils down to applying a linear transformation on X. More precisely, this transformation is a rotation in the directions of maximum variance in the vector space of the P variables. As a result of the PCA, a new matrix Y is created such that Y = XA, with Y a N x P matrix with the rotated data, and A a P x P matrix whose columns are the principal components.
First, we need to find PC1, ie. the first column of the matrix A. According to what is written above (Y=XA), we can write down the N elements of the first column of Y:
Let's define the (column) vector y_{1} as the first column of Y and the (column) vector a_{1} as the first PC (first column of A). In matrix notation, we can re-write the system of equations above as y_{1} = Xa_{1}. We therefore need to find a_{1}. For this, remember that we want PC1 to account for most of the total variance of X. This translates mathematically into maximizing V[y_{1}], which in turns boils down to maximizing each element of V[y_{1}], ie. V[Y_{11}], ..., V[Y_{n1}], ..., V[Y_{N1}]. Let's do it for V[Y_{n1}]. If we define the (row) vector x_{n} to be the n^{th} row of X, then in matrix notation we have V[Y_{n1}] = V[x_{n}a_{1}]. However, it's more common to manipulate only column vectors, so let's define the (column) vector x_{n} to be the n^{th} row of X. This means that . Now let be Σ_{n} the covariance matrix of x_{n}. This allows us to develop into . Without anything else, it's a bit odd to look for a_{1} that maximizes such a formula. Indeed we need to add a constraint, for instance: (squared Euclidean norm, inner product). Such a constraint can be enforced using a Lagrange multiplier λ. At the end, what we want is to find a_{1} that maximizes . For this, we differentiate the above formula with respect to a_{1} and set it to 0: . This shows us that λ is an eigenvalue of Σ_{n} and that a_{1} is the corresponding eigenvector. However, which eigenvector to choose? Well, the one that corresponds to the largest eigenvalue because the later happens to be exactly the variance of each element of PC1. Indeed: . Now that we have the first PC, we want the second one. We do the same as above, but with the additional constraint that PC2 has to be uncorrelated with PC1, ie. Cov[y_{2},y_{1}] = 0. Here again, let's focus on the n^{th} element of the PCs and let's develop the covariance: . Therefore, using two Lagrange multipliers, we now have to find a_{2} that maximizes . Again, we differentiate w.r.t. to a_{2} and set to 0: Σ_{n}a_{2} − λ_{2}a_{2} − φa_{1} = 0. When we multiply on the left by we get: . We can thus write Σ_{n}a_{2} = λ_{2}a_{2} and find a_{2} as the eigenvector corresponding to the second largest eigenvalue of Σ_{n}. By doing this again, we can find all the principal components.
to do ... As any rectangular matrix, we can factorize X using the singular value decomposition (SVD): X = UDV^{T} where U is a NxN unitary matrix, D is a NxP diagonal matrix and V is a PxP unitary matrix. The columns of U are the left singular vectors of X, the columns of V are the right singular vectors of X and D contains the singular values of X. The transformed data are therefore Y = XV.
First, let's simulate data such as expression levels at 2 genes (the variables) in 100 individuals (the samples): set.seed(1859) N <- 100 P <- 2 exp.gene1 <- rnorm(N, mean=3) exp.gene2 <- exp.gene1 + rnorm(N) dat <- cbind(exp.gene1, exp.gene2) plot(dat[,1], dat[,2], xlab="gene 1", ylab="gene 2", main="Original data") Apply PCA using the the built-in "prcomp" function: res.pca <- prcomp(dat, center=TRUE, scale.=FALSE) summary(res.pca) Importance of components: PC1 PC2 Standard deviation 1.5848 0.6121 Proportion of Variance 0.8702 0.1298 Cumulative Proportion 0.8702 1.0000 Visualize the results: plot(res.pca, main="Variances of each PC") axis(side=1, at=c(0.8,2), labels=c(paste("PC1=", format(100*summary(res.pca)$importance[2,"PC1"], digits=2), "%", sep=""), paste("PC2=", format(100*summary(res.pca)$importance[2,"PC2"], digits=2), "%", sep="")), tick=FALSE) plot(res.pca$x[,1], res.pca$x[,2], xlab="PC1", ylab="PC2", main="PC loadings") mtext("each point is a bivariate observation plotted with the PCs as axes", line=0.4) to continue ...
n individuals genotyped at L loci Z_{si}: allelic state of individual i at SNP s, Z: L x n binary matrix Usually we center the data (Z), here by removing the mean of the allelic states at each SNP (mean of each row)
It is also common to standardize the rows so that they have unit variance. We want to approximate the data in X by a matrix Y such that Y = PX. |