User:Timothee Flutre/Notebook/Postdoc/2012/01/02: Difference between revisions
m (→Learn about the multivariate Normal and matrix calculus: mention Sbar_n) |
(→Learn about the multivariate Normal and matrix calculus: add formulas with sufficient stats) |
||
Line 82: | Line 82: | ||
From which follow: | From which follow: | ||
<math>\hat{\mu} = \bar{x} = \frac{1}{ | <math>\hat{\mu} = \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i</math> | ||
and: | and: | ||
<math>\hat{\Sigma} = \bar{S} | <math>\hat{\Sigma} = \bar{S}_N = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T</math> | ||
* '''Sufficient statistics''': | |||
From <math>Z</math> defined above and <math>\bar{S}_{N-1}</math> defined similarly as <math>\bar{S}_N</math> but with <math>N-1</math> in the denominator, we can write the following: | |||
<math>Z = \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T</math> | |||
<math>Z = \sum_{i=1}^N (x_i - \bar{x} + \bar{x} - \mu)(x_i - \bar{x} + \bar{x} - \mu)^T</math> | |||
<math>Z = \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T + \sum_{i=1}^N (\bar{x} - \mu)(\bar{x} - \mu)^T + \sum_{i=1}^N (x_i - \bar{x})(\bar{x} - \mu)^T + \sum_{i=1}^N (\bar{x} - \mu)(x_i - \bar{x})^T</math> | |||
<math>Z = (N-1) \bar{S}_{N-1} + N (\bar{x} - \mu)(\bar{x} - \mu)^T</math> | |||
Thus, by employing the same trick with the trace as above, we now have: | |||
<math>L(\mu, \Sigma) = (2 \pi)^{-NP/2} |\Sigma|^{-N/2} exp \left( -\frac{N}{2}(\bar{x} - \mu)^T\Sigma^{-1}(\bar{x} - \mu) -\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1} \right)</math> | |||
The likelihood depends on the samples only through the pair <math>(\bar{x}, \bar{S}_{N-1})</math>. Thanks to the [http://en.wikipedia.org/wiki/Sufficient_statistic#Fisher.E2.80.93Neyman_factorization_theorem Factorization theorem], we can say that this pair of values is a sufficient statistics for <math>(\mu, \Sigma)</math>. | |||
We can also transform a bit more the formula of the likelihood in order to find the distribution of the sufficient statistics: | |||
<math>L(\mu, \Sigma) = (2 \pi)^{-(N-1)P/2} (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp \left( -\frac{1}{2}(\bar{x} - \mu)^T(\frac{1}{N}\Sigma)^{-1}(\bar{x} - \mu) \right) \times |\Sigma|^{-(N-1)/2} exp \left(-\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right)</math> | |||
<math>L(\mu, \Sigma) \propto N_P(\bar{x}; \mu, \Sigma/N) \times W_P(\bar{S}_{N-1}; \Sigma, N-1)</math> | |||
The likelihood is only proportional because the first constant is not used in any of the two distributions and a few constants are missing (eg. the Gamma function appearing in the density of the [http://en.wikipedia.org/wiki/Wishart_distribution Wishart] distribution). | |||
* '''References''': | * '''References''': |
Revision as of 13:37, 5 April 2012
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Learn about the multivariate Normal and matrix calculus(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)
[math]\displaystyle{ L(\theta) = f(X|\theta) }[/math] As the observations are independent: [math]\displaystyle{ L(\theta) = \prod_{i=1}^N f(x_i | \theta) }[/math] It is easier to work with the log-likelihood: [math]\displaystyle{ l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i | \theta) ) }[/math] [math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math]
[math]\displaystyle{ d(f(u)) = f'(u) du }[/math], eg. useful here: [math]\displaystyle{ d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|) }[/math] [math]\displaystyle{ d(|U|) = |U| tr(U^{-1} dU) }[/math] [math]\displaystyle{ d(U^{-1}) = - U^{-1} (dU) U^{-1} }[/math]
[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) }[/math] As the trace is invariant under cyclic permutations ([math]\displaystyle{ tr(ABC) = tr(BCA) = tr(CAB) }[/math]): [math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math] The trace is also a linear map ([math]\displaystyle{ tr(A+B) = tr(A) + tr(B) }[/math]): [math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math] And finally: [math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \Sigma^{-1} \sum_{i=1}^N (x_i-\mu) (x_i-\mu)^T ) }[/math] As a result: [math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} tr(\Sigma^{-1} Z) }[/math] with [math]\displaystyle{ Z=\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T }[/math] We can now write the first differential of the log-likelihood: [math]\displaystyle{ d l(\theta) = - \frac{N}{2} d(ln(|\Sigma|)) - \frac{1}{2} d(tr(\Sigma^{-1} Z)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} |\Sigma|^{-1} d(|\Sigma|) - \frac{1}{2} tr(d(\Sigma^{-1}Z)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) - \frac{1}{2} tr(d(\Sigma^{-1})Z) - \frac{1}{2} tr(\Sigma^{-1} dZ) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T)) }[/math] At this step in the book, I don't understand how we go from the line above to the line below: [math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + (d\mu)^T \Sigma^{-1} \sum_{i=1}^N (x_i - \mu) }[/math] [math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + n (d\mu)^T \Sigma^{-1} (\bar{x} - \mu) }[/math] The first-order conditions (ie. when [math]\displaystyle{ d l(\theta) = 0 }[/math]) are: [math]\displaystyle{ \hat{\Sigma}^{-1} (Z - n\hat{\Sigma}) \hat{\Sigma}^{-1} = 0 }[/math] and [math]\displaystyle{ \hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0 }[/math] From which follow: [math]\displaystyle{ \hat{\mu} = \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i }[/math] and: [math]\displaystyle{ \hat{\Sigma} = \bar{S}_N = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T }[/math]
From [math]\displaystyle{ Z }[/math] defined above and [math]\displaystyle{ \bar{S}_{N-1} }[/math] defined similarly as [math]\displaystyle{ \bar{S}_N }[/math] but with [math]\displaystyle{ N-1 }[/math] in the denominator, we can write the following: [math]\displaystyle{ Z = \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T }[/math] [math]\displaystyle{ Z = \sum_{i=1}^N (x_i - \bar{x} + \bar{x} - \mu)(x_i - \bar{x} + \bar{x} - \mu)^T }[/math] [math]\displaystyle{ Z = \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T + \sum_{i=1}^N (\bar{x} - \mu)(\bar{x} - \mu)^T + \sum_{i=1}^N (x_i - \bar{x})(\bar{x} - \mu)^T + \sum_{i=1}^N (\bar{x} - \mu)(x_i - \bar{x})^T }[/math] [math]\displaystyle{ Z = (N-1) \bar{S}_{N-1} + N (\bar{x} - \mu)(\bar{x} - \mu)^T }[/math] Thus, by employing the same trick with the trace as above, we now have: [math]\displaystyle{ L(\mu, \Sigma) = (2 \pi)^{-NP/2} |\Sigma|^{-N/2} exp \left( -\frac{N}{2}(\bar{x} - \mu)^T\Sigma^{-1}(\bar{x} - \mu) -\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1} \right) }[/math] The likelihood depends on the samples only through the pair [math]\displaystyle{ (\bar{x}, \bar{S}_{N-1}) }[/math]. Thanks to the Factorization theorem, we can say that this pair of values is a sufficient statistics for [math]\displaystyle{ (\mu, \Sigma) }[/math]. We can also transform a bit more the formula of the likelihood in order to find the distribution of the sufficient statistics: [math]\displaystyle{ L(\mu, \Sigma) = (2 \pi)^{-(N-1)P/2} (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp \left( -\frac{1}{2}(\bar{x} - \mu)^T(\frac{1}{N}\Sigma)^{-1}(\bar{x} - \mu) \right) \times |\Sigma|^{-(N-1)/2} exp \left(-\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right) }[/math] [math]\displaystyle{ L(\mu, \Sigma) \propto N_P(\bar{x}; \mu, \Sigma/N) \times W_P(\bar{S}_{N-1}; \Sigma, N-1) }[/math] The likelihood is only proportional because the first constant is not used in any of the two distributions and a few constants are missing (eg. the Gamma function appearing in the density of the Wishart distribution).
|