User:Timothee Flutre/Notebook/Postdoc/2012/01/02: Difference between revisions
m (→Learn about the multivariate Normal and matrix calculus: mention Sbar_n) |
(→Learn about the multivariate Normal and matrix calculus: finish explanations) |
||
(7 intermediate revisions by the same user not shown) | |||
Line 10: | Line 10: | ||
''(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)'' | ''(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)'' | ||
* '''Motivation''': when we measure items, we often have to measure several properties for each item. For instance, we extract | * '''Motivation''': when we measure items, we often have to measure several properties for each item. For instance, we extract cells from each individual in our study, and we measure the expression level of all genes. We hence have, for each individual, a vector of measurements (one per gene), which leads us to the world of multivariate statistics. | ||
* '''Data''': we have N observations, noted <math> | * '''Data''': we have N observations, noted <math>X_1, X_2, ..., X_N</math>, each being of dimension <math>P</math>. This means that each <math>X_n</math> is a vector belonging to <math>\mathbb{R}^P</math>. Traditionally, we record all the data into an <math> N \times P </math> matrix named <math>X</math>, with samples (observations) in rows and variables (dimensions) in columns. Also, it is usual to assume vectors are in column by default. Therefore, we can write <math>X = (X_1^T, ..., X_N^T)</math>. | ||
* '''Model''': we suppose that the <math> | * '''Descriptive statistics''': the sample mean is the P-dimensional vector <math>\bar{X} = (\bar{X}_1, ..., \bar{X}_P)</math> with <math>\bar{X}_p = X_{\bullet p} = \frac{1}{N} \sum_{n=1}^N X_{np}</math>. The sample covariance is the <math>P \times P</math> matrix noted <math>S^2</math> with <math>S^2_{pq} = \frac{1}{N} \sum_{n=1}^N (X_{np} - \bar{X}_p)(X_{nq} - \bar{X}_q)</math>. | ||
* '''Model''': we suppose that the <math>X_n</math> are independent and identically distributed according to a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate Normal distribution] <math>N_P(\mu, \Sigma)</math>. <math>\mu</math> is the P-dimensional mean vector, and <math>\Sigma</math> the <math>P \times P</math> covariance matrix. If <math>\Sigma</math> is [http://en.wikipedia.org/wiki/Positive-definite_matrix positive definite] (which we will assume), the density function for a given observation is: <math>f(X_n|\mu,\Sigma) = (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp(-\frac{1}{2} (X_n-\mu)^T \Sigma^{-1} (X_n-\mu))</math>, with <math>|M|</math> denoting the determinant of a matrix M and <math>M^T</math> its transpose. | |||
* '''Likelihood''': as usual, we will start by writing down the likelihood of the data, the parameters being <math>\theta=(\mu,\Sigma)</math>: | * '''Likelihood''': as usual, we will start by writing down the likelihood of the data, the parameters being <math>\theta=(\mu,\Sigma)</math>: | ||
Line 46: | Line 48: | ||
As the trace is invariant under cyclic permutations (<math>tr(ABC) = tr(BCA) = tr(CAB)</math>): | As the trace is invariant under cyclic permutations (<math>tr(ABC) = tr(BCA) = tr(CAB)</math>): | ||
<math>\sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T )</math> | <math>\sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T )</math> | ||
The trace is also a linear map (<math>tr(A+B) = tr(A) + tr(B)</math>): | The trace is also a linear map (<math>tr(A+B) = tr(A) + tr(B)</math>): | ||
<math>\sum_{i=1}^N ( | <math>\sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) = tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T )</math> | ||
And finally: | And finally: | ||
<math>\sum_{i=1}^N | <math>tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) = tr( \Sigma^{-1} \sum_{i=1}^N (x_i-\mu) (x_i-\mu)^T )</math> | ||
As a result: | As a result: | ||
Line 70: | Line 72: | ||
<math>d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T))</math> | <math>d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T))</math> | ||
< | <math>d l(\theta) = - \frac{N}{2} tr((d\Sigma) \Sigma^{-1}) \; + \; \frac{1}{2} tr((d\Sigma) \Sigma^{-1} Z \Sigma^{-1}) \; + \; \frac{1}{2} tr(\Sigma^{-1} \sum_{i=1}^N (x_i - \mu) (d\mu)^T) \; + \;\frac{1}{2} tr(\Sigma^{-1} \sum_{i=1}^N (d\mu) (x_i - \mu)^T)</math> | ||
<math>d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - | <math>d l(\theta) = \frac{1}{2} tr((d\Sigma) \Sigma^{-1} (Z - N\Sigma) \Sigma^{-1}) + tr(\Sigma^{-1} \sum_{i=1}^N (x_i - \mu) (d\mu)^T)</math> | ||
<math>d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - | <math>d l(\theta) = \frac{1}{2} tr((d\Sigma) \Sigma^{-1} (Z - N\Sigma) \Sigma^{-1}) + N tr(\Sigma^{-1} (\bar{x} - \mu) (d\mu)^T )</math> | ||
The first-order conditions (ie. when <math>d l(\theta) = 0</math>) are: | The first-order conditions (ie. when <math>d l(\theta) = 0</math>) are: | ||
<math>\hat{\Sigma}^{-1} (Z - | <math>\hat{\Sigma}^{-1} (Z - N\hat{\Sigma}) \hat{\Sigma}^{-1} = 0</math> and <math>\hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0</math> | ||
From which follow: | From which follow: | ||
<math>\hat{\mu} = \bar{x} = \frac{1}{ | <math>\hat{\mu} = \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i</math> | ||
and: | and: | ||
<math>\hat{\Sigma} = \bar{S} | <math>\hat{\Sigma} = \bar{S}_N = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T</math> | ||
Note that <math>\bar{S}_N</math> is a biased estimate of <math>\Sigma</math>. It is usually better to use the unbiased estimate <math>\bar{S}_{N-1}</math>. | |||
* '''Sufficient statistics''': | |||
From <math>Z</math> defined above and <math>\bar{S}_{N-1}</math> defined similarly as <math>\bar{S}_N</math> but with <math>N-1</math> in the denominator, we can write the following: | |||
<math>Z = \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T</math> | |||
<math>Z = \sum_{i=1}^N (x_i - \bar{x} + \bar{x} - \mu)(x_i - \bar{x} + \bar{x} - \mu)^T</math> | |||
<math>Z = \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T + \sum_{i=1}^N (\bar{x} - \mu)(\bar{x} - \mu)^T + \sum_{i=1}^N (x_i - \bar{x})(\bar{x} - \mu)^T + \sum_{i=1}^N (\bar{x} - \mu)(x_i - \bar{x})^T</math> | |||
<math>Z = (N-1) \bar{S}_{N-1} + N (\bar{x} - \mu)(\bar{x} - \mu)^T</math> | |||
Thus, by employing the same trick with the trace as above, we now have: | |||
<math>L(\mu, \Sigma) = (2 \pi)^{-NP/2} |\Sigma|^{-N/2} exp \left( -\frac{N}{2}(\bar{x} - \mu)^T\Sigma^{-1}(\bar{x} - \mu) -\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right)</math> | |||
The likelihood depends on the samples only through the pair <math>(\bar{x}, \bar{S}_{N-1})</math>. Thanks to the [http://en.wikipedia.org/wiki/Sufficient_statistic#Fisher.E2.80.93Neyman_factorization_theorem Factorization theorem], we can say that this pair of values is a [http://en.wikipedia.org/wiki/Sufficient_statistic sufficient statistic] for <math>(\mu, \Sigma)</math>. | |||
We can also transform a bit more the formula of the likelihood in order to find the distribution of this sufficient statistic: | |||
<math>L(\mu, \Sigma) = (2 \pi)^{-(N-1)P/2} (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp \left( -\frac{1}{2}(\bar{x} - \mu)^T(\frac{1}{N}\Sigma)^{-1}(\bar{x} - \mu) \right) \times |\Sigma|^{-(N-1)/2} exp \left(-\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right)</math> | |||
<math>L(\mu, \Sigma) \propto N_P(\bar{x}; \mu, \Sigma/N) \times W_P(\bar{S}_{N-1}; \Sigma, N-1)</math> | |||
The likelihood is only proportional because the first constant is not used in any of the two distributions and a few constants are missing (eg. the Gamma function appearing in the density of the [http://en.wikipedia.org/wiki/Wishart_distribution Wishart] distribution). This doesn't matter as we usually want to maximize the likelihood or compute a likelihood ratio. | |||
* '''References''': | * '''References''': |
Revision as of 11:14, 25 November 2014
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Learn about the multivariate Normal and matrix calculus(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)
[math]\displaystyle{ L(\theta) = f(X|\theta) }[/math] As the observations are independent: [math]\displaystyle{ L(\theta) = \prod_{i=1}^N f(x_i | \theta) }[/math] It is easier to work with the log-likelihood: [math]\displaystyle{ l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i | \theta) ) }[/math] [math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math]
[math]\displaystyle{ d(f(u)) = f'(u) du }[/math], eg. useful here: [math]\displaystyle{ d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|) }[/math] [math]\displaystyle{ d(|U|) = |U| tr(U^{-1} dU) }[/math] [math]\displaystyle{ d(U^{-1}) = - U^{-1} (dU) U^{-1} }[/math]
[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) }[/math] As the trace is invariant under cyclic permutations ([math]\displaystyle{ tr(ABC) = tr(BCA) = tr(CAB) }[/math]): [math]\displaystyle{ \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math] The trace is also a linear map ([math]\displaystyle{ tr(A+B) = tr(A) + tr(B) }[/math]): [math]\displaystyle{ \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) = tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math] And finally: [math]\displaystyle{ tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) = tr( \Sigma^{-1} \sum_{i=1}^N (x_i-\mu) (x_i-\mu)^T ) }[/math] As a result: [math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} tr(\Sigma^{-1} Z) }[/math] with [math]\displaystyle{ Z=\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T }[/math] We can now write the first differential of the log-likelihood: [math]\displaystyle{ d l(\theta) = - \frac{N}{2} d(ln(|\Sigma|)) - \frac{1}{2} d(tr(\Sigma^{-1} Z)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} |\Sigma|^{-1} d(|\Sigma|) - \frac{1}{2} tr(d(\Sigma^{-1}Z)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) - \frac{1}{2} tr(d(\Sigma^{-1})Z) - \frac{1}{2} tr(\Sigma^{-1} dZ) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr((d\Sigma) \Sigma^{-1}) \; + \; \frac{1}{2} tr((d\Sigma) \Sigma^{-1} Z \Sigma^{-1}) \; + \; \frac{1}{2} tr(\Sigma^{-1} \sum_{i=1}^N (x_i - \mu) (d\mu)^T) \; + \;\frac{1}{2} tr(\Sigma^{-1} \sum_{i=1}^N (d\mu) (x_i - \mu)^T) }[/math] [math]\displaystyle{ d l(\theta) = \frac{1}{2} tr((d\Sigma) \Sigma^{-1} (Z - N\Sigma) \Sigma^{-1}) + tr(\Sigma^{-1} \sum_{i=1}^N (x_i - \mu) (d\mu)^T) }[/math] [math]\displaystyle{ d l(\theta) = \frac{1}{2} tr((d\Sigma) \Sigma^{-1} (Z - N\Sigma) \Sigma^{-1}) + N tr(\Sigma^{-1} (\bar{x} - \mu) (d\mu)^T ) }[/math] The first-order conditions (ie. when [math]\displaystyle{ d l(\theta) = 0 }[/math]) are: [math]\displaystyle{ \hat{\Sigma}^{-1} (Z - N\hat{\Sigma}) \hat{\Sigma}^{-1} = 0 }[/math] and [math]\displaystyle{ \hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0 }[/math] From which follow: [math]\displaystyle{ \hat{\mu} = \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i }[/math] and: [math]\displaystyle{ \hat{\Sigma} = \bar{S}_N = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T }[/math] Note that [math]\displaystyle{ \bar{S}_N }[/math] is a biased estimate of [math]\displaystyle{ \Sigma }[/math]. It is usually better to use the unbiased estimate [math]\displaystyle{ \bar{S}_{N-1} }[/math].
From [math]\displaystyle{ Z }[/math] defined above and [math]\displaystyle{ \bar{S}_{N-1} }[/math] defined similarly as [math]\displaystyle{ \bar{S}_N }[/math] but with [math]\displaystyle{ N-1 }[/math] in the denominator, we can write the following: [math]\displaystyle{ Z = \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T }[/math] [math]\displaystyle{ Z = \sum_{i=1}^N (x_i - \bar{x} + \bar{x} - \mu)(x_i - \bar{x} + \bar{x} - \mu)^T }[/math] [math]\displaystyle{ Z = \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T + \sum_{i=1}^N (\bar{x} - \mu)(\bar{x} - \mu)^T + \sum_{i=1}^N (x_i - \bar{x})(\bar{x} - \mu)^T + \sum_{i=1}^N (\bar{x} - \mu)(x_i - \bar{x})^T }[/math] [math]\displaystyle{ Z = (N-1) \bar{S}_{N-1} + N (\bar{x} - \mu)(\bar{x} - \mu)^T }[/math] Thus, by employing the same trick with the trace as above, we now have: [math]\displaystyle{ L(\mu, \Sigma) = (2 \pi)^{-NP/2} |\Sigma|^{-N/2} exp \left( -\frac{N}{2}(\bar{x} - \mu)^T\Sigma^{-1}(\bar{x} - \mu) -\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right) }[/math] The likelihood depends on the samples only through the pair [math]\displaystyle{ (\bar{x}, \bar{S}_{N-1}) }[/math]. Thanks to the Factorization theorem, we can say that this pair of values is a sufficient statistic for [math]\displaystyle{ (\mu, \Sigma) }[/math]. We can also transform a bit more the formula of the likelihood in order to find the distribution of this sufficient statistic: [math]\displaystyle{ L(\mu, \Sigma) = (2 \pi)^{-(N-1)P/2} (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp \left( -\frac{1}{2}(\bar{x} - \mu)^T(\frac{1}{N}\Sigma)^{-1}(\bar{x} - \mu) \right) \times |\Sigma|^{-(N-1)/2} exp \left(-\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right) }[/math] [math]\displaystyle{ L(\mu, \Sigma) \propto N_P(\bar{x}; \mu, \Sigma/N) \times W_P(\bar{S}_{N-1}; \Sigma, N-1) }[/math] The likelihood is only proportional because the first constant is not used in any of the two distributions and a few constants are missing (eg. the Gamma function appearing in the density of the Wishart distribution). This doesn't matter as we usually want to maximize the likelihood or compute a likelihood ratio.
|