# User:Timothee Flutre/Notebook/Postdoc/2012/03/13

Project name Main project page
Previous entry      Next entry

(keywords: PCA, SVD, factor analysis, linear algebra, numerical analysis, statistical interpretation)

• What is PCA? "The central idea of principal component analysis is to reduce the dimensionality of a data set in which there are a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This reduction is achieved by transforming to a new set of variables, the principal components, which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables." from the book "Principal component analysis" by I. Jolliffe.

• Theoretical intuition:

Let be X is the data matrix. It is a N x P matrix recording the measurements made on N samples (in rows) at P variables (in columns). This means that Xnp corresponds to the measurement of the pth variable on the nth sample, with $n \in \{1,...,N\}$ and $p \in \{1,...,P\}$.

Concretely, what the description of I. Jolliffe above means is the following. By doing a PCA of X, we build P linear combinations of the original P variables. These linear combinations are called "principal components" (PCs). Most importantly, we hope that the first K PCs account for most of the total sample variance, and that $K \ll P$). If this is the case, we can confidently approximate X by the first K columns of Y, thereby discarding the measurements at the remaining K-P variables.

In matrix notation, doing the PCA of X boils down to applying a linear transformation on X. More precisely, this transformation is a rotation in the directions of maximum variance in the vector space of the P variables. As a result of the PCA, a new matrix Y is created such that Y = XA, with Y a N x P matrix with the rotated data, and A a P x P matrix whose columns are the principal components.

• Mathematical derivation:

First, we need to find PC1, ie. the first column of the matrix A. According to what is written above (Y=XA), we can write down the N elements of the first column of Y:

$\begin{cases} Y_{11} = X_{11} A_{11} + X_{12} A_{21} + ... + X_{1P} A_{P1} \\ ... \\ Y_{n1} = X_{n1} A_{11} + X_{n2} A_{21} + ... + X_{nP} A_{P1} \\ ... \\ Y_{N1} = X_{N1} A_{11} + X_{N2} A_{21} + ... + X_{NP} A_{P1} \end{cases}$

Let's define the (column) vector y1 as the first column of Y and the (column) vector a1 as the first PC (first column of A). In matrix notation, we can re-write the system of equations above as y1 = Xa1.

We therefore need to find a1. For this, remember that we want PC1 to account for most of the total variance of X. This translates mathematically into maximizing V[y1], which in turns boils down to maximizing each element of V[y1], ie. V[Y11], ..., V[Yn1], ..., V[YN1].

Let's do it for V[Yn1]. If we define the (row) vector xn to be the nth row of X, then in matrix notation we have V[Yn1] = V[xna1]. However, it's more common to manipulate only column vectors, so let's define the (column) vector xn to be the nth row of X. This means that $V[Y_{n1}] = V[a_1^T x_n]$. Now let be Σn the covariance matrix of xn. This allows us to develop $V[a_1^T x_n]$ into $a_1^T \Sigma_n a_1$.

Without anything else, it's a bit odd to look for a1 that maximizes such a formula. Indeed we need to add a constraint, for instance: $\lVert a_1 \rVert ^2 = a_1^T a_1 = 1$ (squared Euclidean norm, inner product). Such a constraint can be enforced using a Lagrange multiplier λ. At the end, what we want is to find a1 that maximizes $a_1^T \Sigma_n a_1 - \lambda (a_1^T a_1 - 1)$.

For this, we differentiate the above formula with respect to a1 and set it to 0: $\Sigma_n a_1 - \lambda a_1 = 0 \Leftrightarrow \Sigma_n a_1 = \lambda a_1$. This shows us that λ is an eigenvalue of Σn and that a1 is the corresponding eigenvector.

However, which eigenvector to choose? Well, the one that corresponds to the largest eigenvalue because the later happens to be exactly the variance of each element of PC1. Indeed: $V[Y_{n1}] = V[a_1^T x_n] = a_1^T \Sigma_n a_1 = a_1^T \lambda a_1 = \lambda a_1^T a_1 = \lambda$.

Now that we have the first PC, we want the second one. We do the same as above, but with the additional constraint that PC2 has to be uncorrelated with PC1, ie. Cov[y2,y1] = 0. Here again, let's focus on the nth element of the PCs and let's develop the covariance: $Cov[Y_{n2}, Y_{n1}] = Cov[a_2^T x_n, a_1^T x_n] = a_2^T \Sigma_n a_1 = \lambda_1 a_2^T a_1$. Therefore, using two Lagrange multipliers, we now have to find a2 that maximizes $a_2^T \Sigma_n a_2 - \lambda_2(a_2^T a_2 - 1) - \phi a_2^T a_1$. Again, we differentiate w.r.t. to a2 and set to 0: Σna2 − λ2a2 − φa1 = 0. When we multiply on the left by $a_1^T$ we get: $a_1^T \Sigma_n a_2 - \lambda_2 a_1^T a_2 - \phi a_1^T a_1 = 0 \Leftrightarrow \phi = 0$. We can thus write Σna2 = λ2a2 and find a2 as the eigenvector corresponding to the second largest eigenvalue of Σn.

By doing this again, we can find all the principal components.

• EVD or SVD?

to do ...

As any rectangular matrix, we can factorize X using the singular value decomposition (SVD): X = UDVT

where U is a NxN unitary matrix, D is a NxP diagonal matrix and V is a PxP unitary matrix. The columns of U are the left singular vectors of X, the columns of V are the right singular vectors of X and D contains the singular values of X.

The transformed data are therefore Y = XV.

• Example in R:

First, let's simulate data such as expression levels at 2 genes (the variables) in 100 individuals (the samples):

set.seed(1859)
N <- 100
P <- 2
exp.gene1 <- rnorm(N, mean=3)
exp.gene2 <- exp.gene1 + rnorm(N)
dat <- cbind(exp.gene1, exp.gene2)
plot(dat[,1], dat[,2], xlab="gene 1", ylab="gene 2", main="Original data")


Apply PCA using the the built-in "prcomp" function:

res.pca <- prcomp(dat, center=TRUE, scale.=FALSE)
summary(res.pca)
Importance of components:
PC1    PC2
Standard deviation     1.5848 0.6121
Proportion of Variance 0.8702 0.1298
Cumulative Proportion  0.8702 1.0000


Visualize the results:

plot(res.pca, main="Variances of each PC")
axis(side=1, at=c(0.8,2), labels=c(paste("PC1=", format(100*summary(res.pca)$importance[2,"PC1"], digits=2), "%", sep=""), paste("PC2=", format(100*summary(res.pca)$importance[2,"PC2"], digits=2), "%", sep="")),
tick=FALSE)
plot(res.pca$x[,1], res.pca$x[,2], xlab="PC1", ylab="PC2", main="PC loadings")
mtext("each point is a bivariate observation plotted with the PCs as axes", line=0.4)


to continue ...

• from the article "A Genealogical Interpretation of Principal Components Analysis" by G. McVean

n individuals genotyped at L loci

Zsi: allelic state of individual i at SNP s, $\in {0,1}$

Z: L x n binary matrix

Usually we center the data (Z), here by removing the mean of the allelic states at each SNP (mean of each row)

$X_{si} = Z_{si} - \frac{1}{n} \sum_{j=1}^n Z_{sj}$

It is also common to standardize the rows so that they have unit variance.

We want to approximate the data in X by a matrix Y such that Y = PX.