User:Timothee Flutre/Notebook/Postdoc/2012/01/02
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Learn about the multivariate Normal and matrix calculus(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)
[math]\displaystyle{ L(\theta) = f(X|\theta) }[/math] As the observations are independent: [math]\displaystyle{ L(\theta) = \prod_{i=1}^N f(x_i | \theta) }[/math] It is easier to work with the log-likelihood: [math]\displaystyle{ l(\theta) = ln(L(\theta)) = \sum_{i=1}^N ln( f(x_i | \theta) ) }[/math] [math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) }[/math]
[math]\displaystyle{ d(f(u)) = f'(u) du }[/math], eg. useful here: [math]\displaystyle{ d(ln(|\Sigma|)) = |\Sigma|^{-1} d(|\Sigma|) }[/math] [math]\displaystyle{ d(|U|) = |U| tr(U^{-1} dU) }[/math] [math]\displaystyle{ d(U^{-1}) = - U^{-1} (dU) U^{-1} }[/math]
[math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) ) }[/math] As the trace is invariant under cyclic permutations ([math]\displaystyle{ tr(ABC) = tr(BCA) = tr(CAB) }[/math]): [math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = \sum_{i=1}^N tr( \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math] The trace is also a linear map ([math]\displaystyle{ tr(A+B) = tr(A) + tr(B) }[/math]): [math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \sum_{i=1}^N \Sigma^{-1} (x_i-\mu) (x_i-\mu)^T ) }[/math] And finally: [math]\displaystyle{ \sum_{i=1}^N (x_i-\mu)^T \Sigma^{-1} (x_i-\mu) = tr( \Sigma^{-1} \sum_{i=1}^N (x_i-\mu) (x_i-\mu)^T ) }[/math] As a result: [math]\displaystyle{ l(\theta) = -\frac{NP}{2} ln(2\pi) - \frac{N}{2}ln(|\Sigma|) - \frac{1}{2} tr(\Sigma^{-1} Z) }[/math] with [math]\displaystyle{ Z=\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T }[/math] We can now write the first differential of the log-likelihood: [math]\displaystyle{ d l(\theta) = - \frac{N}{2} d(ln(|\Sigma|)) - \frac{1}{2} d(tr(\Sigma^{-1} Z)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} |\Sigma|^{-1} d(|\Sigma|) - \frac{1}{2} tr(d(\Sigma^{-1}Z)) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) - \frac{1}{2} tr(d(\Sigma^{-1})Z) - \frac{1}{2} tr(\Sigma^{-1} dZ) }[/math] [math]\displaystyle{ d l(\theta) = - \frac{N}{2} tr(\Sigma^{-1} d\Sigma) + \frac{1}{2} tr(\Sigma^{-1} (d\Sigma) \Sigma^{-1} Z) + \frac{1}{2} tr(\Sigma^{-1} (\sum_{i=1}^N (x_i - \mu) (d\mu)^T + \sum_{i=1}^N (d\mu) (x_i - \mu)^T)) }[/math] At this step in the book, I don't understand how we go from the line above to the line below: [math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + (d\mu)^T \Sigma^{-1} \sum_{i=1}^N (x_i - \mu) }[/math] [math]\displaystyle{ d l(\theta) = \frac{1}{2} tr(d\Sigma)\Sigma^{-1} (Z - n\Sigma) \Sigma^{-1} + n (d\mu)^T \Sigma^{-1} (\bar{x} - \mu) }[/math] The first-order conditions (ie. when [math]\displaystyle{ d l(\theta) = 0 }[/math]) are: [math]\displaystyle{ \hat{\Sigma}^{-1} (Z - n\hat{\Sigma}) \hat{\Sigma}^{-1} = 0 }[/math] and [math]\displaystyle{ \hat{\Sigma}^{-1} (\bar{x} - \hat{\mu}) = 0 }[/math] From which follow: [math]\displaystyle{ \hat{\mu} = \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i }[/math] and: [math]\displaystyle{ \hat{\Sigma} = \bar{S}_N = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T }[/math]
From [math]\displaystyle{ Z }[/math] defined above and [math]\displaystyle{ \bar{S}_{N-1} }[/math] defined similarly as [math]\displaystyle{ \bar{S}_N }[/math] but with [math]\displaystyle{ N-1 }[/math] in the denominator, we can write the following: [math]\displaystyle{ Z = \sum_{i=1}^N (x_i-\mu)(x_i-\mu)^T }[/math] [math]\displaystyle{ Z = \sum_{i=1}^N (x_i - \bar{x} + \bar{x} - \mu)(x_i - \bar{x} + \bar{x} - \mu)^T }[/math] [math]\displaystyle{ Z = \sum_{i=1}^N (x_i - \bar{x})(x_i - \bar{x})^T + \sum_{i=1}^N (\bar{x} - \mu)(\bar{x} - \mu)^T + \sum_{i=1}^N (x_i - \bar{x})(\bar{x} - \mu)^T + \sum_{i=1}^N (\bar{x} - \mu)(x_i - \bar{x})^T }[/math] [math]\displaystyle{ Z = (N-1) \bar{S}_{N-1} + N (\bar{x} - \mu)(\bar{x} - \mu)^T }[/math] Thus, by employing the same trick with the trace as above, we now have: [math]\displaystyle{ L(\mu, \Sigma) = (2 \pi)^{-NP/2} |\Sigma|^{-N/2} exp \left( -\frac{N}{2}(\bar{x} - \mu)^T\Sigma^{-1}(\bar{x} - \mu) -\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1} \right) }[/math] The likelihood depends on the samples only through the pair [math]\displaystyle{ (\bar{x}, \bar{S}_{N-1}) }[/math]. Thanks to the Factorization theorem, we can say that this pair of values is a sufficient statistics for [math]\displaystyle{ (\mu, \Sigma) }[/math]. We can also transform a bit more the formula of the likelihood in order to find the distribution of the sufficient statistics: [math]\displaystyle{ L(\mu, \Sigma) = (2 \pi)^{-(N-1)P/2} (2 \pi)^{-P/2} |\Sigma|^{-1/2} exp \left( -\frac{1}{2}(\bar{x} - \mu)^T(\frac{1}{N}\Sigma)^{-1}(\bar{x} - \mu) \right) \times |\Sigma|^{-(N-1)/2} exp \left(-\frac{N-1}{2}tr(\Sigma^{-1}\bar{S}_{N-1}) \right) }[/math] [math]\displaystyle{ L(\mu, \Sigma) \propto N_P(\bar{x}; \mu, \Sigma/N) \times W_P(\bar{S}_{N-1}; \Sigma, N-1) }[/math] The likelihood is only proportional because the first constant is not used in any of the two distributions and a few constants are missing (eg. the Gamma function appearing in the density of the Wishart distribution). This doesn't matter as we usually want to maximize the likelihood or compute a likelihood ratio.
|