User:Timothee Flutre/Notebook/Postdoc/2012/01/02: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 10: Line 10:
''(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)''
''(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)''


* '''Motivation''': when we measure things, we often have to measure several properties for each item. For instance, we extract a sample of cells from each person, and we measure the expression level of all genes in the sample. We hence have, for each person, a vector of measurements, which leads us to the world of multivariate statistics.
* '''Motivation''': when we measure items, we often have to measure several properties for each item. For instance, we extract a sample of cells from each individual in our study, and we measure the expression level of all genes in the sample. We hence have, for each individual, a vector of measurements (one per gene), which leads us to the world of multivariate statistics.


* '''Data''': we have N observations, noted <math>X = (x_1, x_2, ..., x_N)</math>, each being of dimension <math>P</math>. This means that each <math>x_i</math> is a vector belonging to <math>\mathbb{R}^P</math>.
* '''Data''': we have N observations, noted <math>X = (x_1, x_2, ..., x_N)</math>, each being of dimension <math>P</math>. This means that each <math>x_i</math> is a vector belonging to <math>\mathbb{R}^P</math>.
Line 30: Line 30:
<math>l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu)</math>
<math>l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu)</math>


* '''ML estimation''': as usual, to find the [http://en.wikipedia.org/wiki/Maximum_likelihood maximum-likelihood estimates] of the parameters, we need to derive the log-likelihood with respect to each parameter, and then take the values of the parameters at which the log-likelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of [http://en.wikipedia.org/wiki/Matrix_calculus matrix calculus], which it is not always straightforward...
* '''ML estimation''': as usual, to find the [http://en.wikipedia.org/wiki/Maximum_likelihood maximum-likelihood estimates] of the parameters, we need to derive the log-likelihood with respect to each parameter, and then take the values of the parameters at which the log-likelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of [http://en.wikipedia.org/wiki/Matrix_calculus matrix calculus], which is not always straightforward...


* '''Matrix calculus''':
* '''Matrix calculus''':
Line 40: Line 40:
<math>d(U^{-1}) = - U^{-1} (dU) U^{-1}</math>
<math>d(U^{-1}) = - U^{-1} (dU) U^{-1}</math>


... <TO DO> ...
* '''Technical details''': from Magnus and Neudecker (third edition, Part Six, Chapter 15, Section 3, p.353). First, they re-write the log-likelihood, noting that <math>(x_i-\mu)^T \Sigma^{-1} (x_i-\mu)</math> is a scalar, ie. a 1x1 matrix, and is therefore equal to its [http://en.wikipedia.org/wiki/Trace_%28linear_algebra%29 trace]:
 
* '''MLE method 1''': from Magnus and Neudecker (third edition, Part Six, Chapter 15, Section 3, p.353). First, they re-write the log-likelihood, noting that <math>(x_i-\mu)^T \Sigma^{-1} (x_i-\mu)</math> is a scalar, ie. a 1x1 matrix, and is therefore equal to its [http://en.wikipedia.org/wiki/Trace_%28linear_algebra%29 trace]:


<math>\sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) )</math>
<math>\sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) )</math>

Revision as of 12:54, 2 April 2012

Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Learn about the multivariate Normal and matrix calculus

(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)

  • Motivation: when we measure items, we often have to measure several properties for each item. For instance, we extract a sample of cells from each individual in our study, and we measure the expression level of all genes in the sample. We hence have, for each individual, a vector of measurements (one per gene), which leads us to the world of multivariate statistics.
  • Data: we have N observations, noted [math]\displaystyle{ X = (x_1, x_2, ..., x_N) }[/math], each being of dimension [math]\displaystyle{ P }[/math]. This means that each [math]\displaystyle{ x_i }[/math] is a vector belonging to [math]\displaystyle{ \mathbb{R}^P }[/math].
  • Model: we suppose that the [math]\displaystyle{ x_i }[/math] are independent and identically distributed according to a multivariate Normal distribution [math]\displaystyle{ N_P(\mu, \Sigma) }[/math]. [math]\displaystyle{ \mu }[/math] is the P-dimensional mean vector, and [math]\displaystyle{ \Sigma }[/math] the PxP covariance matrix. If [math]\displaystyle{ \Sigma }[/math] is positive definite (which we will assume), the density function for a given x is: [math]\displaystyle{ f(x/\mu,\Sigma) = (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)) }[/math], with [math]\displaystyle{ |M| }[/math] denoting the determinant of a matrix and [math]\displaystyle{ M^T }[/math] its transpose.
  • Likelihood: as usual, we will start by writing down the likelihood of the data, the parameters being [math]\displaystyle{ \theta=(\mu,\Sigma) }[/math]:

[math]\displaystyle{ L(\theta) = \mathbb{P}(X/\theta) }[/math]

As the observations are independent:

[math]\displaystyle{ L(\theta) = \prod_{i=1}^N f(x_i / \theta) }[/math]

It is easier to work with the log-likelihood:

[math]\displaystyle{ l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i / \theta) ) }[/math]

[math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math]

  • ML estimation: as usual, to find the maximum-likelihood estimates of the parameters, we need to derive the log-likelihood with respect to each parameter, and then take the values of the parameters at which the log-likelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of matrix calculus, which is not always straightforward...
  • Matrix calculus:

[math]\displaystyle{ d(f(u)) = f'(u) du }[/math], eg. useful here: [math]\displaystyle{ d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|) }[/math]

[math]\displaystyle{ d(|U|) = |U| tr(U^{-1} dU) }[/math]

[math]\displaystyle{ d(U^{-1}) = - U^{-1} (dU) U^{-1} }[/math]

  • Technical details: from Magnus and Neudecker (third edition, Part Six, Chapter 15, Section 3, p.353). First, they re-write the log-likelihood, noting that [math]\displaystyle{ (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math] is a scalar, ie. a 1x1 matrix, and is therefore equal to its trace:

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) }[/math]

As the trace is invariant under cyclic permutations ([math]\displaystyle{ tr(ABC) = tr(BCA) = tr(CAB) }[/math]):

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math]

The trace is also a linear map ([math]\displaystyle{ tr(A+B) = tr(A) + tr(B) }[/math]):

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math]

And finally:

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \Sigma^{-1} \sum_{i=1}^N (x_i-\mu) (x_i-\mu)^T ) }[/math]

As a result:

[math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} tr(\Sigma^{-1} Z) }[/math] with [math]\displaystyle{ Z=\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T }[/math]

We can now write the first differential of the log-likelihood:

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} d(ln(|\Sigma|)) - \frac{1}{2} d(tr(\Sigma^{-1} Z)) }[/math]

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} |\Sigma|^{-1} d(|\Sigma|) - \frac{1}{2} tr(d(\Sigma^{-1}Z)) }[/math]

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) - \frac{1}{2} tr(d(\Sigma^{-1})Z) - \frac{1}{2} tr(\Sigma^{-1} dZ) }[/math]

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T)) }[/math]

At this step in the book, I don't understand how we go from the line above to the line below:

[math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + (d\mu)^T \Sigma^{-1} \sum_{i=1}^N (x_i - \mu) }[/math]

[math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + n (d\mu)^T \Sigma^{-1} (\bar{x} - \mu) }[/math]

The first-order conditions are:

[math]\displaystyle{ \hat{\Sigma}^{-1} (Z - n\hat{\Sigma}) \hat{\Sigma}^{-1} = 0 }[/math] and [math]\displaystyle{ \hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0 }[/math]

From which follow:

[math]\displaystyle{ \hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^N x_i }[/math]

and:

[math]\displaystyle{ \hat{\Sigma} = \frac{1}{n} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T }[/math]

  • References:
    • Magnus and Neudecker, Matrix differential calculus with applications in statistics and econometrics (2007)
    • Wand, Vector differential calculus in statistics (The American Statistician, 2002)