User:Timothee Flutre/Notebook/Postdoc/2012/01/02: Difference between revisions
m (→Learn about the multivariate Normal and matrix calculus: improve explanation with trace) |
(→Learn about the multivariate Normal and matrix calculus: finish explanations) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 72: | Line 72: | ||
<math>d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T))</math> | <math>d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T))</math> | ||
< | <math>d l(\theta) = - \frac{N}{2} tr((d\Sigma) \Sigma^{-1}) \; + \; \frac{1}{2} tr((d\Sigma) \Sigma^{-1} Z \Sigma^{-1}) \; + \; \frac{1}{2} tr(\Sigma^{-1} \sum_{i=1}^N (x_i - \mu) (d\mu)^T) \; + \;\frac{1}{2} tr(\Sigma^{-1} \sum_{i=1}^N (d\mu) (x_i - \mu)^T)</math> | ||
<math>d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - N\Sigma) \Sigma^{-1} + ( | <math>d l(\theta) = \frac{1}{2} tr((d\Sigma) \Sigma^{-1} (Z - N\Sigma) \Sigma^{-1}) + tr(\Sigma^{-1} \sum_{i=1}^N (x_i - \mu) (d\mu)^T)</math> | ||
<math>d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - N\Sigma) \Sigma^{-1} + N ( | <math>d l(\theta) = \frac{1}{2} tr((d\Sigma) \Sigma^{-1} (Z - N\Sigma) \Sigma^{-1}) + N tr(\Sigma^{-1} (\bar{x} - \mu) (d\mu)^T )</math> | ||
The first-order conditions (ie. when <math>d l(\theta) = 0</math>) are: | The first-order conditions (ie. when <math>d l(\theta) = 0</math>) are: |
Revision as of 11:14, 25 November 2014
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Learn about the multivariate Normal and matrix calculus(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)
[math]\displaystyle{ L(\theta) = f(X|\theta) }[/math] As the observations are independent: [math]\displaystyle{ L(\theta) = \prod_{i=1}^N f(x_i | \theta) }[/math] It is easier to work with the log-likelihood: [math]\displaystyle{ l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i | \theta) ) }[/math] [math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math]
[math]\displaystyle{ d(f(u)) = f'(u) du }[/math], eg. useful here: [math]\displaystyle{ d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|) }[/math] [math]\displaystyle{ d(|U|) = |U| tr(U^{-1} dU) }[/math] [math]\displaystyle{ d(U^{-1}) = - U^{-1} (dU) U^{-1} }[/math]
[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) }[/math] As the trace is invariant under cyclic permutations ([math]\displaystyle{ tr(ABC) = tr(BCA) = tr(CAB) }[/math]): [math]\displaystyle{ \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math] The trace is also a linear map ([math]\displaystyle{ tr(A+B) = tr(A) + tr(B) }[/math]): [math]\displaystyle{ \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) = tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math] And finally: [math]\displaystyle{ tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) = tr( \Sigma^{-1} \sum_{i=1}^N (x_i-\mu) (x_i-\mu)^T ) }[/math] As a result: [math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} tr(\Sigma^{-1} Z) }[/math] with [math]\displaystyle{ Z=\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T }[/math] We can now write the first differential of the log-likelihood: [math]\displaystyle{ d l(\theta) = - \frac{N}{2} d(ln(|\Sigma|)) - \frac{1}{2} d(tr(\Sigma^{-1} Z)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} |\Sigma|^{-1} d(|\Sigma|) - \frac{1}{2} tr(d(\Sigma^{-1}Z)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) - \frac{1}{2} tr(d(\Sigma^{-1})Z) - \frac{1}{2} tr(\Sigma^{-1} dZ) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr((d\Sigma) \Sigma^{-1}) \; + \; \frac{1}{2} tr((d\Sigma) \Sigma^{-1} Z \Sigma^{-1}) \; + \; \frac{1}{2} tr(\Sigma^{-1} \sum_{i=1}^N (x_i - \mu) (d\mu)^T) \; + \;\frac{1}{2} tr(\Sigma^{-1} \sum_{i=1}^N (d\mu) (x_i - \mu)^T) }[/math] [math]\displaystyle{ d l(\theta) = \frac{1}{2} tr((d\Sigma) \Sigma^{-1} (Z - N\Sigma) \Sigma^{-1}) + tr(\Sigma^{-1} \sum_{i=1}^N (x_i - \mu) (d\mu)^T) }[/math] [math]\displaystyle{ d l(\theta) = \frac{1}{2} tr((d\Sigma) \Sigma^{-1} (Z - N\Sigma) \Sigma^{-1}) + N tr(\Sigma^{-1} (\bar{x} - \mu) (d\mu)^T ) }[/math] The first-order conditions (ie. when [math]\displaystyle{ d l(\theta) = 0 }[/math]) are: [math]\displaystyle{ \hat{\Sigma}^{-1} (Z - N\hat{\Sigma}) \hat{\Sigma}^{-1} = 0 }[/math] and [math]\displaystyle{ \hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0 }[/math] From which follow: [math]\displaystyle{ \hat{\mu} = \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i }[/math] and: [math]\displaystyle{ \hat{\Sigma} = \bar{S}_N = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T }[/math] Note that [math]\displaystyle{ \bar{S}_N }[/math] is a biased estimate of [math]\displaystyle{ \Sigma }[/math]. It is usually better to use the unbiased estimate [math]\displaystyle{ \bar{S}_{N-1} }[/math].
From [math]\displaystyle{ Z }[/math] defined above and [math]\displaystyle{ \bar{S}_{N-1} }[/math] defined similarly as [math]\displaystyle{ \bar{S}_N }[/math] but with [math]\displaystyle{ N-1 }[/math] in the denominator, we can write the following: [math]\displaystyle{ Z = \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T }[/math] [math]\displaystyle{ Z = \sum_{i=1}^N (x_i - \bar{x} + \bar{x} - \mu)(x_i - \bar{x} + \bar{x} - \mu)^T }[/math] [math]\displaystyle{ Z = \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T + \sum_{i=1}^N (\bar{x} - \mu)(\bar{x} - \mu)^T + \sum_{i=1}^N (x_i - \bar{x})(\bar{x} - \mu)^T + \sum_{i=1}^N (\bar{x} - \mu)(x_i - \bar{x})^T }[/math] [math]\displaystyle{ Z = (N-1) \bar{S}_{N-1} + N (\bar{x} - \mu)(\bar{x} - \mu)^T }[/math] Thus, by employing the same trick with the trace as above, we now have: [math]\displaystyle{ L(\mu, \Sigma) = (2 \pi)^{-NP/2} |\Sigma|^{-N/2} exp \left( -\frac{N}{2}(\bar{x} - \mu)^T\Sigma^{-1}(\bar{x} - \mu) -\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right) }[/math] The likelihood depends on the samples only through the pair [math]\displaystyle{ (\bar{x}, \bar{S}_{N-1}) }[/math]. Thanks to the Factorization theorem, we can say that this pair of values is a sufficient statistic for [math]\displaystyle{ (\mu, \Sigma) }[/math]. We can also transform a bit more the formula of the likelihood in order to find the distribution of this sufficient statistic: [math]\displaystyle{ L(\mu, \Sigma) = (2 \pi)^{-(N-1)P/2} (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp \left( -\frac{1}{2}(\bar{x} - \mu)^T(\frac{1}{N}\Sigma)^{-1}(\bar{x} - \mu) \right) \times |\Sigma|^{-(N-1)/2} exp \left(-\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right) }[/math] [math]\displaystyle{ L(\mu, \Sigma) \propto N_P(\bar{x}; \mu, \Sigma/N) \times W_P(\bar{S}_{N-1}; \Sigma, N-1) }[/math] The likelihood is only proportional because the first constant is not used in any of the two distributions and a few constants are missing (eg. the Gamma function appearing in the density of the Wishart distribution). This doesn't matter as we usually want to maximize the likelihood or compute a likelihood ratio.
|