Variational Bayes approach for the mixture of Normals
 Motivation: I have described on another page the basics of mixture models and the EM algorithm in a frequentist context. It is worth reading before continuing. Here I am interested in the Bayesian approach as well as in a specific variational method (nicknamed "Variational Bayes").
 Data: we have N univariate observations, , gathered into the vector .
 Assumptions: we assume the observations to be exchangeable and distributed according to a mixture of K Normal distributions. The parameters of this model are the mixture weights (w_{k}), the means (μ_{k}) and the precisions (τ_{k}) of each mixture components, all gathered into . There are two constraints: and .
 Observed likelihood:
 Maximizing the observed loglikelihood: as shown here, maximizing the likelihood of a mixture model is like doing a weighted likelihood maximization. However, these weights depend on the parameters we want to estimate! That's why we now switch to the missingdata formulation of the mixture model.
 Latent variables: let's introduce N latent variables, , gathered into the vector . Each z_{n} is a vector of length K with a single 1 indicating the component to which the n^{th} observation belongs, and K1 zeroes.
 Augmented likelihood:
 Priors: in the Bayesian paradigm, parameters and latent variables are random variables for which we want to infer the posterior distribution. To make the calculations possible, we choose for them prior distributions that are conjuguate with the form of the likelihood.
 for the parameters: and
 for the latent variables: and
 Variational Bayes: our primary goal here is to calculate the marginal loglikelihood of our data set:
However the fact that there are latent variables induce dependencies between all the parameters of the model.
This makes it difficult to find the parameters that maximize the marginal likelihood.
An elegant solution is to introduce a "variational distribution" q of the parameters and the latent variables
The constant C is here to remind us that q has the constraint of being a distribution, ie. of summing to 1, which can be enforced by a Lagrange multiplier.
The crucial assumption is to assume the independence of the parameters and the latent variables:
We can then use the concavity of the logarithm and Jensen's inequality to optimize a lower bound of the marginal loglikelihood:
Now we have to optimize the righthand side of the inequality. Let's name it as it is a functional, ie. a function of functions. Using the calculus of variations, we'll find the function q that maximizes it.
