EM Algorithm: Intuition and Explanation through Gaussian Mixture Models

Pranik Chainani
4 min readJul 14, 2021

--

In continuation with our goal of mixture modelling and approximating optimal parameters, we will utilize the Expectation-Maximization (EM) Algorithm. This blog will offer some insight and intuition into this magnificent approach to parameter estimation.

To recap, maximum likelihood estimation (MLE) was used as an approach to model multi-modal distributions by figuring out the set of optimal parameters that maximized the joint probabilities given these parameters.

Last time, we were able to deploy a numerical method for finding the MLE by reparameterizing our constraints. However, it is often the case where this reparameterization step maybe difficult to compute, often due to the influence of unobservable, hidden variables. As such, EM offers a great, statistically-flavored alternative that is fast and easy to implement.

EM Algorithm

Suppose we have a training dataset with latent variables (such as mixture probabilities) that interact with the dataset but aren’t directly observed. As such, direct maximization of the likelihood function becomes largely intractable. In response to this, the EM algorithm, at its crux, follows an iterative method that computes the MLE given these latent variables.

The EM algorithm does this by following an alternating 2-state convergent mechanism that works by first estimating the latent variables themselves and optimizing our model accordingly, until the aforementioned convergence is observed.

These two steps are clearly defined as follows:

E-Step: estimate the latent variables (cluster values)
-> the e-step can be done by essentially assigning each data point to a cluster distribution. As such, what we do then is essentially compute the probability it came from one sub-distribution over the other.

M-Step: maximize the observable parameters given the estimated latent variables (cluster values)
-> the m-step merely updates the parameters for each sub-distribution (mean, variance, etc.) weighted by mixture probability found in e-step for each cluster.

Example: EM for Gaussian Mixture Modelling

Suppose we have two normal sub-distributions that comprise our overall density with probability (note pi as mixture probability):

Recall how we calculated our log-likelihood from last time:

Direct maximization of this log-likelihood looks quite disturbing :(.

Alas, the EM algorithm comes to our rescue. Let us define our latent variable as z (an integer), which stores our cluster group.

Our pseudo-code is then as follows (where K is the total number of clusters, and our EM algorithm below is with respect to cluster 1; we weight accordingly with the remaining cluster k -> p(zk = k| xn)).

These weights for each cluster group are simply formulate as the expectation of a data point being within a single cluster given our parameters and data; the remaining clusters are weighted as 0.

There great thing about Gaussian Mixture Models and the EM algorithm is that it offers a more robust clustering model as opposed to the commonly used K-means, since the K-means algorithm merely partitions data in K clusters based on distance metrics as opposed to quantifying levels on uncertainty given latent variables.

EM Algorithm in Action!

EM clustering on Old Faithful eruption data: wikipedia

Gaussian Mixture Models offer great statistical modelling power with naturally occurring data, best used for mapping mean location and variance/covariance observation on normal structured data. And since, GMMs are powered by the EM algorithm, it is clear to see how essential this approximation algorithm is the the statistical modelling world!

What particularity highlights the power of GMMs is that they are entirely unsupervised, so there is no need to have extensively labeled data for our model to learn a distribution.

And the advantage of incorporating the EM algorithm is that we are ALWAYS guaranteed a likelihood increase after each iteration, whereas many numerical solutions often are trapped in suboptimal localized wells. Furthermore, the E-step and the M-step are much easier to calculate than going through extensive reparameterization measures that are often unintuitive and may not always guarantee the most optimal solution.

--

--