User:Timothee Flutre/Notebook/Postdoc/2012/01/02

From OpenWetWare
Revision as of 11:07, 2 January 2012 by Timothee Flutre (talk | contribs) (→‎Learn about the multivariate Normal and matrix calculus: add matrix calculus formulas and MLE)
Jump to navigationJump to search
Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Learn about the multivariate Normal and matrix calculus

(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)

  • Motivation: when we measure things, we often have to measure several properties for each item. For instance, we extract a sample of cells from each person, and we measure the expression level of all genes in the sample. We hence have, for each person, a vector of measurements, which leads us to the world of multivariate statistics.
  • Data: we have N observations, noted [math]\displaystyle{ X = (x_1, x_2, ..., x_N) }[/math], each being of dimension [math]\displaystyle{ P }[/math]. This means that each [math]\displaystyle{ x_i }[/math] is a vector belonging to [math]\displaystyle{ \mathbb{R}^P }[/math].
  • Model: we suppose that the [math]\displaystyle{ x_i }[/math] are independent and identically distributed according to a multivariate Normal distribution [math]\displaystyle{ N_P(\mu, \Sigma) }[/math]. [math]\displaystyle{ \mu }[/math] is the P-dimensional mean vector, and [math]\displaystyle{ \Sigma }[/math] the PxP covariance matrix. If [math]\displaystyle{ \Sigma }[/math] is positive definite (which we will assume), the density function for a given x is: [math]\displaystyle{ f(x/\mu,\Sigma) = (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)) }[/math], with [math]\displaystyle{ |M| }[/math] denoting the determinant of a matrix and [math]\displaystyle{ M^T }[/math] its transpose.
  • Likelihood: as usual, we will start by writing down the likelihood of the data, the parameters being [math]\displaystyle{ \theta=(\mu,\Sigma) }[/math]:

[math]\displaystyle{ L(\theta) = \mathbb{P}(X/\theta) }[/math]

As the observations are independent:

[math]\displaystyle{ L(\theta) = \prod_{i=1}^N f(x_i / \theta) }[/math]

It is easier to work with the log-likelihood:

[math]\displaystyle{ l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i / \theta) ) }[/math]

[math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math]

  • ML estimation: as usual, to find the maximum-likelihood estimates of the parameters, we need to derive the log-likelihood with respect to each parameter, and then take the values of the parameters at which the log-likelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of matrix calculus, which it is not always straightforward...
  • Matrix calculus:

[math]\displaystyle{ d(f(u)) = f'(u) du }[/math], eg. useful here: [math]\displaystyle{ d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|) }[/math]

[math]\displaystyle{ d(|U|) = |U| tr(U^{-1} dU) }[/math]

[math]\displaystyle{ d(U^{-1}) = - U^{-1} (dU) U^{-1} }[/math]

... <TO DO> ...

  • MLE method 1: from Magnus and Neudecker (third edition, Part Six, Chapter 15, Section 3, p.353). First, they re-write the log-likelihood, noting that [math]\displaystyle{ (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math] is a scalar, ie. a 1x1 matrix, and is therefore equal to its trace:

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) }[/math]

As the trace is invariant under cyclic permutations ([math]\displaystyle{ tr(ABC) = tr(BCA) = tr(CAB) }[/math]):

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math]

The trace is also a linear map ([math]\displaystyle{ tr(A+B) = tr(A) + tr(B) }[/math]):

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math]

And finally:

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \Sigma^{-1} \sum_{i=1}^N (x_i-\mu) (x_i-\mu)^T ) }[/math]

As a result:

[math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} tr(\Sigma^{-1} Z) }[/math] with [math]\displaystyle{ Z=\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T }[/math]

We can now write the first differential of the log-likelihood:

[math]\displaystyle{ d_{\theta}l(\theta) = - \frac{N}{2} d(ln(|\Sigma|)) - \frac{1}{2} d(tr(\Sigma^{-1} Z)) }[/math]

... <TO DO> ...

The first-order conditions are:

[math]\displaystyle{ \hat{\Sigma}^{-1} (Z - n\hat{\Sigma}) \hat{\Sigma}^{-1} = 0 }[/math] and [math]\displaystyle{ \hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0 }[/math]

From which follow:

[math]\displaystyle{ \hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^N x_i }[/math]

and:

[math]\displaystyle{ \hat{\Sigma} = \frac{1}{n} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T }[/math]

  • References:
    • Magnus and Neudecker, Matrix differential calculus with applications in statistics and econometrics (2007)
    • Wand, Vector differential calculus in statistics (The American Statistician, 2002)