User:Timothee Flutre/Notebook/Postdoc/2012/01/02: Difference between revisions

Revision as of 12:54, 2 April 2012

Project name

<html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>      </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Learn about the multivariate Normal and matrix calculus

(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)

Motivation: when we measure items, we often have to measure several properties for each item. For instance, we extract a sample of cells from each individual in our study, and we measure the expression level of all genes in the sample. We hence have, for each individual, a vector of measurements (one per gene), which leads us to the world of multivariate statistics.

Data: we have N observations, noted [math]\displaystyle{ X = (x_1, x_2, ..., x_N) }[/math], each being of dimension [math]\displaystyle{ P }[/math]. This means that each [math]\displaystyle{ x_i }[/math] is a vector belonging to [math]\displaystyle{ \mathbb{R}^P }[/math].

Model: we suppose that the [math]\displaystyle{ x_i }[/math] are independent and identically distributed according to a multivariate Normal distribution [math]\displaystyle{ N_P(\mu, \Sigma) }[/math]. [math]\displaystyle{ \mu }[/math] is the P-dimensional mean vector, and [math]\displaystyle{ \Sigma }[/math] the PxP covariance matrix. If [math]\displaystyle{ \Sigma }[/math] is positive definite (which we will assume), the density function for a given x is: [math]\displaystyle{ f(x/\mu,\Sigma) = (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)) }[/math], with [math]\displaystyle{ |M| }[/math] denoting the determinant of a matrix and [math]\displaystyle{ M^T }[/math] its transpose.

Likelihood: as usual, we will start by writing down the likelihood of the data, the parameters being [math]\displaystyle{ \theta=(\mu,\Sigma) }[/math]:

[math]\displaystyle{ L(\theta) = \mathbb{P}(X/\theta) }[/math]

As the observations are independent:

[math]\displaystyle{ L(\theta) = \prod_{i=1}^N f(x_i / \theta) }[/math]

It is easier to work with the log-likelihood:

[math]\displaystyle{ l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i / \theta) ) }[/math]

[math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math]

ML estimation: as usual, to find the maximum-likelihood estimates of the parameters, we need to derive the log-likelihood with respect to each parameter, and then take the values of the parameters at which the log-likelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of matrix calculus, which is not always straightforward...

Matrix calculus:

[math]\displaystyle{ d(f(u)) = f'(u) du }[/math], eg. useful here: [math]\displaystyle{ d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|) }[/math]

[math]\displaystyle{ d(|U|) = |U| tr(U^{-1} dU) }[/math]

[math]\displaystyle{ d(U^{-1}) = - U^{-1} (dU) U^{-1} }[/math]

Technical details: from Magnus and Neudecker (third edition, Part Six, Chapter 15, Section 3, p.353). First, they re-write the log-likelihood, noting that [math]\displaystyle{ (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math] is a scalar, ie. a 1x1 matrix, and is therefore equal to its trace:

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) }[/math]

As the trace is invariant under cyclic permutations ([math]\displaystyle{ tr(ABC) = tr(BCA) = tr(CAB) }[/math]):

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math]

The trace is also a linear map ([math]\displaystyle{ tr(A+B) = tr(A) + tr(B) }[/math]):

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math]

And finally:

[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \Sigma^{-1} \sum_{i=1}^N (x_i-\mu) (x_i-\mu)^T ) }[/math]

As a result:

[math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} tr(\Sigma^{-1} Z) }[/math] with [math]\displaystyle{ Z=\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T }[/math]

We can now write the first differential of the log-likelihood:

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} d(ln(|\Sigma|)) - \frac{1}{2} d(tr(\Sigma^{-1} Z)) }[/math]

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} |\Sigma|^{-1} d(|\Sigma|) - \frac{1}{2} tr(d(\Sigma^{-1}Z)) }[/math]

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) - \frac{1}{2} tr(d(\Sigma^{-1})Z) - \frac{1}{2} tr(\Sigma^{-1} dZ) }[/math]

[math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T)) }[/math]

At this step in the book, I don't understand how we go from the line above to the line below:

[math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + (d\mu)^T \Sigma^{-1} \sum_{i=1}^N (x_i - \mu) }[/math]

[math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + n (d\mu)^T \Sigma^{-1} (\bar{x} - \mu) }[/math]

The first-order conditions are:

[math]\displaystyle{ \hat{\Sigma}^{-1} (Z - n\hat{\Sigma}) \hat{\Sigma}^{-1} = 0 }[/math] and [math]\displaystyle{ \hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0 }[/math]

From which follow:

[math]\displaystyle{ \hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i=1}^N x_i }[/math]

and:

[math]\displaystyle{ \hat{\Sigma} = \frac{1}{n} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T }[/math]

References:
- Magnus and Neudecker, Matrix differential calculus with applications in statistics and econometrics (2007)
- Wand, Vector differential calculus in statistics (The American Statistician, 2002)

@@ Line 10: / Line 10: @@
 ''(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)''
-* '''Motivation''': when we measure things, we often have to measure several properties for each item. For instance, we extract a sample of cells from each person, and we measure the expression level of all genes in the sample. We hence have, for each person, a vector of measurements, which leads us to the world of multivariate statistics.
+* '''Motivation''': when we measure items, we often have to measure several properties for each item. For instance, we extract a sample of cells from each individual in our study, and we measure the expression level of all genes in the sample. We hence have, for each individual, a vector of measurements (one per gene), which leads us to the world of multivariate statistics.
 * '''Data''': we have N observations, noted <math>X = (x_1, x_2, ..., x_N)</math>, each being of dimension <math>P</math>. This means that each <math>x_i</math> is a vector belonging to <math>\mathbb{R}^P</math>.
@@ Line 30: / Line 30: @@
 <math>l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu)</math>
-* '''ML estimation''': as usual, to find the [http://en.wikipedia.org/wiki/Maximum_likelihood maximum-likelihood estimates] of the parameters, we need to derive the log-likelihood with respect to each parameter, and then take the values of the parameters at which the log-likelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of [http://en.wikipedia.org/wiki/Matrix_calculus matrix calculus], which it is not always straightforward...
+* '''ML estimation''': as usual, to find the [http://en.wikipedia.org/wiki/Maximum_likelihood maximum-likelihood estimates] of the parameters, we need to derive the log-likelihood with respect to each parameter, and then take the values of the parameters at which the log-likelihood is zero. However, in the case of multivariate distributions, this requires knowing a bit of [http://en.wikipedia.org/wiki/Matrix_calculus matrix calculus], which is not always straightforward...
 * '''Matrix calculus''':
@@ Line 40: / Line 40: @@
 <math>d(U^{-1}) = - U^{-1} (dU) U^{-1}</math>
-... <TO DO> ...
+* '''Technical details''': from Magnus and Neudecker (third edition, Part Six, Chapter 15, Section 3, p.353). First, they re-write the log-likelihood, noting that <math>(x_i-\mu)^T \Sigma^{-1} (x_i-\mu)</math> is a scalar, ie. a 1x1 matrix, and is therefore equal to its [http://en.wikipedia.org/wiki/Trace_%28linear_algebra%29 trace]:
-* '''MLE method 1''': from Magnus and Neudecker (third edition, Part Six, Chapter 15, Section 3, p.353). First, they re-write the log-likelihood, noting that <math>(x_i-\mu)^T \Sigma^{-1} (x_i-\mu)</math> is a scalar, ie. a 1x1 matrix, and is therefore equal to its [http://en.wikipedia.org/wiki/Trace_%28linear_algebra%29 trace]:
 <math>\sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) )</math>

User:Timothee Flutre/Notebook/Postdoc/2012/01/02: Difference between revisions

Revision as of 12:54, 2 April 2012

Learn about the multivariate Normal and matrix calculus

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools