User:Timothee Flutre/Notebook/Postdoc/2012/08/16
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Variational Bayes approach for the mixture of Normals
[math]\displaystyle{ \Theta = \{w_1,\ldots,w_K,\mu_1,\ldots,\mu_K,\tau_1,\ldots,\tau_K\} }[/math]
[math]\displaystyle{ p(\mathbf{y} | \Theta, K) = \prod_{n=1}^N p(y_n|\Theta,K) = \prod_{n=1}^N \sum_{k=1}^K w_k Normal(y_n;\mu_k,\tau_k^{-1}) }[/math]
[math]\displaystyle{ p(\mathbf{y},\mathbf{z}|\Theta,K) = \prod_{n=1}^N p(y_n,z_n|\Theta,K) = \prod_{n=1}^N p(z_n|\Theta,K) p(y_n|z_n,\Theta,K) = \prod_{n=1}^N \prod_{k=1}^K w_k^{z_{nk}} Normal(y_n;\mu_k,\tau_k^{-1})^{z_{nk}} }[/math]
[math]\displaystyle{ \mathrm{ln} \, p(\mathbf{y} | \Theta, K) = \sum_{n=1}^N \mathrm{ln} \, \int_{z_n} \mathrm{d}{z_n} \; p(y_n, z_n | \Theta, K) }[/math] The latent variables induce dependencies between all the parameters of the model. This makes it difficult to find the parameters that maximize the likelihood. An elegant solution is to introduce a variational distribution of parameters and latent variables, which leads to a re-formulation of the classical EM algorithm. But let's show it directly in the Bayesian paradigm.
[math]\displaystyle{ \mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; p(\mathbf{y}, \mathbf{z}, \Theta | K) }[/math] [math]\displaystyle{ \mathrm{ln} \, p(\mathbf{y} | K) = \mathrm{ln} \, \left( \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} \right) + C_{\mathbf{z}, \Theta} }[/math] The constant [math]\displaystyle{ C }[/math] is here to remind us that [math]\displaystyle{ q }[/math] has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier. We can then use the concavity of the logarithm (Jensen's inequality) to derive a lower bound of the marginal log-likelihood: [math]\displaystyle{ \mathrm{ln} \, p(\mathbf{y} | K) \ge \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} + C_{\mathbf{z}, \Theta} = \mathcal{F}(q) }[/math] Let's call this lower bound [math]\displaystyle{ \mathcal{F}(q) }[/math] as it is a functional, ie. a function of functions. To gain some intuition about the impact of introducing [math]\displaystyle{ q }[/math], let's expand [math]\displaystyle{ \mathcal{F} }[/math]: [math]\displaystyle{ \mathcal{F}(q) = \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) + \int_\mathbf{z} \int_\Theta \, \mathrm{d}\mathbf{z} \, \mathrm{d}\Theta \; q(\mathbf{z}, \Theta) \; \mathrm{ln} \, \frac{p(\mathbf{z}, \Theta | K)}{q(\mathbf{z}, \Theta)} + C_{\mathbf{z}, \Theta} }[/math] [math]\displaystyle{ \mathcal{F}(q) = \mathrm{ln} \, p(\mathbf{y} | \mathbf{z}, \Theta, K) - D_{KL}(q || p) }[/math] From this, it is clear that [math]\displaystyle{ \mathcal{F} }[/math] (ie. a lower-bound of the marginal log-likelihood) is the conditional log-likelihood minus the Kullback-Leibler divergence between the variational distribution [math]\displaystyle{ q }[/math] and the joint posterior of latent variables and parameters. In practice, we have to make the following crucial assumption of independence on [math]\displaystyle{ q }[/math] in order for the calculations to be analytically tractable: [math]\displaystyle{ q(\mathbf{z}, \Theta) = q_{\mathbf{z}}(\mathbf{z}) q_\Theta(\Theta) }[/math] This means that [math]\displaystyle{ q_\mathbf{z} q_\Theta }[/math] approximates the joint posterior, and therefore the lower-bound will be tight if and only if this approximation is exact and the KL divergence is zero. As we ultimately aim at inferring the parameters and latent variables that maximize the marginal log-likelihood, we will use the calculus of variations to find the functions [math]\displaystyle{ q_\mathbf{z} }[/math] and [math]\displaystyle{ q_\Theta }[/math] that maximize the functional [math]\displaystyle{ \mathcal{F} }[/math]. [math]\displaystyle{ \mathcal{F}(q_\mathbf{z}, q_\Theta) = \int_\Theta \, \mathrm{d}\Theta \; \left( \int_\mathbf{z} \, \mathrm{d}\mathbf{z} \; q(\mathbf{z}) \; \mathrm{ln} \, \frac{p(\mathbf{y}, \mathbf{z} | \Theta, K)}{q(\mathbf{z})} + \mathrm{ln} \, \frac{p(\Theta | K)}{q(\Theta)} \right) + C_{\mathbf{z}} + C_{\Theta} }[/math] This naturally leads to a procedure very similar to the EM algorithm where, at the E step, we calculate the expectations of the parameters with respect to the variational distributions [math]\displaystyle{ q_\mathbf{z} }[/math] and [math]\displaystyle{ q_\Theta }[/math], and, at the M step, we recompute the variational distributions over the parameters.
TODO
TODO |