User:Timothee Flutre/Notebook/Postdoc/2011/12/14
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Learn about mixture models and the EM algorithm(Caution, this is my own quick-and-dirty tutorial, see the references at the end for presentations by professional statisticians.)
[math]\displaystyle{ f(x_i|\theta) = \sum_{k=1}^{K} w_k \phi(x_i|\mu_k,\sigma_k) = \sum_{k=1}^{K} w_k \frac{1}{\sqrt{2\pi} \sigma_k} \exp \left(-\frac{1}{2}(\frac{x_i - \mu_k}{\sigma_k})^2 \right) }[/math] The constraints are: [math]\displaystyle{ \forall k, w_k \gt 0 }[/math] and [math]\displaystyle{ \sum_{k=1}^K w_k = 1 }[/math]
[math]\displaystyle{ L(\theta) = P(X|\theta) = \prod_{i=1}^N f(x_i|\theta) }[/math] As usual, it's easier to deal with the log-likelihood: [math]\displaystyle{ l(\theta) = \sum_{i=1}^N ln \left( f(x_i|\theta) \right) = \sum_{i=1}^N ln \left( \sum_{k=1}^K w_k \phi(x_i; \theta_k) \right) }[/math] Let's take the derivative with respect to one parameter, eg. [math]\displaystyle{ \theta_l }[/math]: [math]\displaystyle{ \frac{\partial l}{\partial \theta_l} = \sum_{i=1}^N \frac{1}{\sum_{k=1}^K w_k \phi(x_i; \theta_k)} w_l \frac{\partial \phi(x_i; \theta_l)}{\partial \theta_l} }[/math] [math]\displaystyle{ \frac{\partial l}{\partial \theta_l} = \sum_{i=1}^N \frac{w_l \phi(x_i; \theta_l)}{\sum_{k=1}^K w_k \phi(x_i; \theta_k)} \frac{1}{\phi(x_i; \theta_l)} \frac{\partial \phi(x_i; \theta_l)}{\partial \theta_l} }[/math] [math]\displaystyle{ \frac{\partial l}{\partial \theta_l} = \sum_{i=1}^N \frac{w_l \phi(x_i; \theta_l)}{\sum_{k=1}^K w_k \phi(x_i; \theta_k)} \frac{\partial ln ( \phi(x_i; \theta_l) )}{\partial \theta_l} }[/math] This shows that maximizing the likelihood of a mixture model is like doing a weighted likelihood maximization. However, these weights depend on the parameters we want to estimate! That's why we now switch to the missing-data formulation of the mixture model.
[math]\displaystyle{ p(k|i) = P(z_{ik}=1|x_i,\theta) = \frac{w_k \phi(x_i|\mu_k,\sigma_k)}{\sum_{l=1}^K w_l \phi(x_i|\mu_l,\sigma_l)} }[/math] The observed-data likelihood (also called sometimes "incomplete" or "marginal", even though these appellations are misnomers) is still written the same way: [math]\displaystyle{ L_{obs}(\theta) = P(X|\theta) = \prod_{i=1}^N f(x_i|\theta) }[/math] But now we can also write the augmented-data likelihood, assuming all observations are independent conditionally on their membership: [math]\displaystyle{ L_{aug}(\theta) = P(X,Z|\theta) = \prod_{i=1}^N P(x_i|z_i,\theta) P(z_i|\theta) = \prod_{i=1}^N \left( \prod_{k=1}^K \phi(x_i|\mu_k,\sigma_k)^{z_{ik}} w_k^{z_{ik}} \right) }[/math].
The EM algorithm is proven to always increase the observed-data log-likelihood at each iteration. Yes, the observed-data log-likelihood, even though it works on the augmented-data log-likelihood...
Here is the observed-data log-likelihood: [math]\displaystyle{ l_{obs} = \sum_{i=1}^N ln \left( f(x_i|\theta) \right) }[/math] First we introduce the hidden variables by integrating them out: [math]\displaystyle{ l_{obs} = \sum_{i=1}^N ln \left( \int p(x_i,z_i|\theta) dz_i \right) }[/math] Then, we use an arbitrary distribution on these hidden variables: [math]\displaystyle{ l_{obs} = \sum_{i=1}^N ln \left( \int q(z_i) \frac{p(x_i,z_i|\theta)}{q(z_i)} dz_i \right) }[/math] And here is the great trick, as explained by Beal: "any probability distribution [math]\displaystyle{ q(z_i) }[/math] on the hidden variables gives rise to a lower bound on [math]\displaystyle{ l_{obs} }[/math]": [math]\displaystyle{ l_{obs} \ge \sum_{i=1}^N \int q(z_i) ln \left( \frac{p(x_i,z_i|\theta)}{q(z_i)} \right) dz_i }[/math] This is due to to the Jensen inequality (the logarithm is concave). [math]\displaystyle{ l_{obs} \ge \sum_{i=1}^N \int q(z_i) ln \left( p(x_i,z_i|\theta) \right) dz_i - \int q(z_i) ln \left( q(z_i) \right) dz_i = Q(q(z), \theta) }[/math] At each iteration, the E step maximizes the lower bound ([math]\displaystyle{ Q }[/math], right part of the equation above) with respect to the [math]\displaystyle{ q(z_i) }[/math]. This amounts to inferring the posterior distribution of the hidden variables given the current parameter [math]\displaystyle{ \theta^{(t)} }[/math]: [math]\displaystyle{ q^{(t+1)}(z_i) = p(z_i | x_i, \theta^{(t)}) }[/math] The [math]\displaystyle{ q^{(t+1)}(z_i) }[/math] makes the bound tight (the inequality becomes an equality). Indeed: [math]\displaystyle{ Q(q^{(t+1)}(z), \theta^{(t)}) = \sum_{i=1}^N \int q^{(t+1)}(z_i) ln \left( \frac{p(x_i,z_i|\theta^{(t)})}{q^{(t+1)}(z_i)} \right) dz_i }[/math] [math]\displaystyle{ Q(q^{(t+1)}(z), \theta^{(t)}) = \sum_{i=1}^N \int p(z_i | x_i, \theta^{(t)}) ln \left( \frac{p(x_i,z_i|\theta^{(t)})}{p(z_i | x_i, \theta^{(t)})} \right) dz_i }[/math] [math]\displaystyle{ Q(q^{(t+1)}(z), \theta^{(t)}) = \sum_{i=1}^N \int p(z_i | x_i, \theta^{(t)}) ln \left( \frac{p(x_i|\theta^{(t)}) p(z_i|x_i,\theta^{(t)})}{p(z_i | x_i, \theta^{(t)})} \right) dz_i }[/math] [math]\displaystyle{ Q(q^{(t+1)}(z), \theta^{(t)}) = \sum_{i=1}^N \int p(z_i | x_i, \theta^{(t)}) ln \left( p(x_i|\theta^{(t)}) \right) dz_i }[/math] [math]\displaystyle{ Q(q^{(t+1)}(z), \theta^{(t)}) = \sum_{i=1}^N ln \left( p(x_i|\theta^{(t)}) \right) }[/math] [math]\displaystyle{ Q(q^{(t+1)}(z), \theta^{(t)}) = l_{obs}(\theta^{(t)}) }[/math] Then, at the M step, we use these statistics to maximize Q with respect to [math]\displaystyle{ \theta }[/math], and therefore find [math]\displaystyle{ \theta^{(t+1)} }[/math].
[math]\displaystyle{ \frac{\partial l(\theta)}{\partial \mu_k} = \sum_{i=1}^N \frac{1}{f(x_i/\theta)} \frac{\partial f(x_i/\theta)}{\partial \mu_k} }[/math] As we derive with respect to [math]\displaystyle{ \mu_k }[/math], all the others means [math]\displaystyle{ \mu_l }[/math] with [math]\displaystyle{ l \ne k }[/math] are constant, and thus disappear: [math]\displaystyle{ \frac{\partial f(x_i/\theta)}{\partial \mu_k} = w_k \frac{\partial g(x_i/\mu_k,\sigma_k)}{\partial \mu_k} }[/math] And finally: [math]\displaystyle{ \frac{\partial g(x_i/\mu_k,\sigma_k)}{\partial \mu_k} = \frac{\mu_k - x_i}{\sigma_k^2} g(x_i/\mu_k,\sigma_k) }[/math] Once we put all together, we end up with: [math]\displaystyle{ \frac{\partial l(\theta)}{\partial \mu_k} = \sum_{i=1}^N \frac{1}{\sigma^2} \frac{w_k g(x_i/\mu_k,\sigma_k)}{\sum_{l=1}^K w_l g(x_i/\mu_l,\sigma_l)} (\mu_k - x_i) = \sum_{i=1}^N \frac{1}{\sigma^2} p(k/i) (\mu_k - x_i) }[/math] By convention, we note [math]\displaystyle{ \hat{\mu_k} }[/math] the maximum-likelihood estimate of [math]\displaystyle{ \mu_k }[/math]: [math]\displaystyle{ \frac{\partial l(\theta)}{\partial \mu_k}_{\mu_k=\hat{\mu_k}} = 0 }[/math] Therefore, we finally obtain: [math]\displaystyle{ \hat{\mu_k} = \frac{\sum_{i=1}^N p(k/i) x_i}{\sum_{i=1}^N p(k/i)} }[/math] By doing the same kind of algebra, we derive the log-likelihood w.r.t. [math]\displaystyle{ \sigma_k }[/math]: [math]\displaystyle{ \frac{\partial l(\theta)}{\partial \sigma_k} = \sum_{i=1}^N p(k/i) (\frac{-1}{\sigma_k} + \frac{(x_i - \mu_k)^2}{\sigma_k^3}) }[/math] And then we obtain the ML estimates for the standard deviation of each cluster: [math]\displaystyle{ \hat{\sigma_k} = \sqrt{\frac{\sum_{i=1}^N p(k/i) (x_i - \mu_k)^2}{\sum_{i=1}^N p(k/i)}} }[/math] The partial derivative of [math]\displaystyle{ l(\theta) }[/math] w.r.t. [math]\displaystyle{ w_k }[/math] is tricky because of the constraints on the [math]\displaystyle{ w_k }[/math]. But we can handle it by writing them in terms of unconstrained variables [math]\displaystyle{ \gamma_k }[/math] (softmax function): [math]\displaystyle{ w_k = \frac{e^{\gamma_k}}{\sum_{k=1}^K e^{\gamma_k}} }[/math] [math]\displaystyle{ \frac{\partial w_k}{\partial \gamma_j} = \begin{cases} w_k - w_k^2 & \mbox{if }j = k \\ -w_kw_j & \mbox{otherwise} \end{cases} }[/math] [math]\displaystyle{ \frac{\partial l(\theta)}{\partial w_k} = \sum_{i=1}^N (p(k/i) - w_k) }[/math] Finally, here are the ML estimates for the mixture weights: [math]\displaystyle{ \hat{w}_k = \frac{1}{N} \sum_{i=1}^N p(k/i) }[/math]
#' Generate univariate observations from a mixture of Normals #' #' @param K number of components #' @param N number of observations #' @param gap difference between all component means GetUnivariateSimulatedData <- function(K=2, N=100, gap=6){ mus <- seq(0, gap*(K-1), gap) sigmas <- runif(n=K, min=0.5, max=1.5) tmp <- floor(rnorm(n=K-1, mean=floor(N/K), sd=5)) ns <- c(tmp, N - sum(tmp)) clusters <- as.factor(matrix(unlist(lapply(1:K, function(k){rep(k, ns[k])})), ncol=1)) obs <- matrix(unlist(lapply(1:K, function(k){ rnorm(n=ns[k], mean=mus[k], sd=sigmas[k]) }))) new.order <- sample(1:N, N) obs <- obs[new.order] rownames(obs) <- NULL clusters <- clusters[new.order] return(list(obs=obs, clusters=clusters, mus=mus, sigmas=sigmas, mix.weights=ns/N)) }
#' Return probas of latent variables given data and parameters from previous iteration #' #' @param data Nx1 vector of observations #' @param params list which components are mus, sigmas and mix.weights Estep <- function(data, params){ GetMembershipProbas(data, params$mus, params$sigmas, params$mix.weights) } #' Return the membership probabilities P(zi=k/xi,theta) #' #' @param data Nx1 vector of observations #' @param mus Kx1 vector of means #' @param sigmas Kx1 vector of std deviations #' @param mix.weights Kx1 vector of mixture weights w_k=P(zi=k/theta) #' @return NxK matrix of membership probas GetMembershipProbas <- function(data, mus, sigmas, mix.weights){ N <- length(data) K <- length(mus) tmp <- matrix(unlist(lapply(1:N, function(i){ x <- data[i] norm.const <- sum(unlist(Map(function(mu, sigma, mix.weight){ mix.weight * GetUnivariateNormalDensity(x, mu, sigma)}, mus, sigmas, mix.weights))) unlist(Map(function(mu, sigma, mix.weight){ mix.weight * GetUnivariateNormalDensity(x, mu, sigma) / norm.const }, mus[-K], sigmas[-K], mix.weights[-K])) })), ncol=K-1, byrow=TRUE) membership.probas <- cbind(tmp, apply(tmp, 1, function(x){1 - sum(x)})) names(membership.probas) <- NULL return(membership.probas) } #' Univariate Normal density GetUnivariateNormalDensity <- function(x, mu, sigma){ return( 1/(sigma * sqrt(2*pi)) * exp(-1/(2*sigma^2)*(x-mu)^2) ) }
#' Return ML estimates of parameters #' #' @param data Nx1 vector of observations #' @param params list which components are mus, sigmas and mix.weights #' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) Mstep <- function(data, params, membership.probas){ params.new <- list() sum.membership.probas <- apply(membership.probas, 2, sum) params.new$mus <- GetMlEstimMeans(data, membership.probas, sum.membership.probas) params.new$sigmas <- GetMlEstimStdDevs(data, params.new$mus, membership.probas, sum.membership.probas) params.new$mix.weights <- GetMlEstimMixWeights(data, membership.probas, sum.membership.probas) return(params.new) } #' Return ML estimates of the means (1 per cluster) #' #' @param data Nx1 vector of observations #' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) #' @param sum.membership.probas Kx1 vector of sum per column of matrix above #' @return Kx1 vector of means GetMlEstimMeans <- function(data, membership.probas, sum.membership.probas){ K <- ncol(membership.probas) sapply(1:K, function(k){ sum(unlist(Map("*", membership.probas[,k], data))) / sum.membership.probas[k] }) } #' Return ML estimates of the std deviations (1 per cluster) #' #' @param data Nx1 vector of observations #' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) #' @param sum.membership.probas Kx1 vector of sum per column of matrix above #' @return Kx1 vector of std deviations GetMlEstimStdDevs <- function(data, means, membership.probas, sum.membership.probas){ K <- ncol(membership.probas) sapply(1:K, function(k){ sqrt(sum(unlist(Map(function(p_ki, x_i){ p_ki * (x_i - means[k])^2 }, membership.probas[,k], data))) / sum.membership.probas[k]) }) } #' Return ML estimates of the mixture weights #' #' @param data Nx1 vector of observations #' @param membership.probas NxK matrix with entry i,k being P(zi=k/xi,theta) #' @param sum.membership.probas Kx1 vector of sum per column of matrix above #' @return Kx1 vector of mixture weights GetMlEstimMixWeights <- function(data, membership.probas, sum.membership.probas){ K <- ncol(membership.probas) sapply(1:K, function(k){ 1/length(data) * sum.membership.probas[k] }) }
GetLogLikelihood <- function(data, mus, sigmas, mix.weights){ loglik <- sum(sapply(data, function(x){ log(sum(unlist(Map(function(mu, sigma, mix.weight){ mix.weight * GetUnivariateNormalDensity(x, mu, sigma) }, mus, sigmas, mix.weights)))) })) return(loglik) }
EMalgo <- function(data, params, threshold.convergence=10^(-2), nb.iter=10, verbose=1){ logliks <- vector() i <- 1 if(verbose > 0) cat(paste("iter ", i, "\n", sep="")) membership.probas <- Estep(data, params) params <- Mstep(data, params, membership.probas) loglik <- GetLogLikelihood(data, params$mus, params$sigmas, params$mix.weights) logliks <- append(logliks, loglik) while(i < nb.iter){ i <- i + 1 if(verbose > 0) cat(paste("iter ", i, "\n", sep="")) membership.probas <- Estep(data, params) params <- Mstep(data, params, membership.probas) loglik <- GetLogLikelihood(data, params$mus, params$sigmas, params$mix.weights) if(loglik < logliks[length(logliks)]){ msg <- paste("the log-likelihood is decreasing:", loglik, "<", logliks[length(logliks)]) stop(msg, call.=FALSE) } logliks <- append(logliks, loglik) if(abs(logliks[i] - logliks[i-1]) <= threshold.convergence) break } return(list(params=params, membership.probas=membership.probas, logliks=logliks, nb.iters=i)) }
## simulate data K <- 3 N <- 300 simul <- GetUnivariateSimulatedData(K, N) data <- simul$obs ## run the EM algorithm params0 <- list(mus=runif(n=K, min=min(data), max=max(data)), sigmas=rep(1, K), mix.weights=rep(1/K, K)) res <- EMalgo(data, params0, 10^(-3), 1000, 1) ## check its convergence plot(res$logliks, xlab="iterations", ylab="log-likelihood", main="Convergence of the EM algorithm", type="b") ## plot the data along with the inferred densities png("mixture_univar_em.png") hist(data, breaks=30, freq=FALSE, col="grey", border="white", ylim=c(0,0.15), main="Histogram of data overlaid with densities inferred by EM") rx <- seq(from=min(data), to=max(data), by=0.1) ds <- lapply(1:K, function(k){dnorm(x=rx, mean=res$params$mus[k], sd=res$params$sigmas[k])}) f <- sapply(1:length(rx), function(i){ res$params$mix.weights[1] * ds[[1]][i] + res$params$mix.weights[2] * ds[[2]][i] + res$params$mix.weights[3] * ds[[3]][i] }) lines(rx, f, col="red", lwd=2) dev.off() It seems to work well, which was expected as the clusters are well separated from each other... The classification of each observation can be obtained via the following command: ## get the classification of the observations memberships <- apply(res$membership.probas, 1, function(x){which(x > 0.5)}) table(memberships)
|