Expectation Maximization Algorithm

Expectation Maximization Algorithm Rong Jin

A Mixture Model Problem • Apparently, the dataset consists of two modes • How can we automatically identify the two modes?

Gaussian Mixture Model (GMM) • Assume that the dataset is generated by two mixed Gaussian distributions • Gaussian model 1: • Gaussian model 2: • If we know the memberships for each bin, estimating the two Gaussian models is easy. • How to estimate the two Gaussian models without knowing the memberships of bins?

EM Algorithm for GMM • Let memberships to be hidden variables • EM algorithm for Gaussian mixture model • Unknown memberships: • Unknown Gaussian models: • Learn these two sets of parameters iteratively

Start with A Random Guess • Random assign the memberships to each bin

Start with A Random Guess • Random assign the memberships to each bin • Estimate the means and variance of each Gaussian model

E-step • Fixed the two Gaussian models • Estimate the posterior for each data point

EM Algorithm for GMM • Re-estimate the memberships for each bin

Weighted by posteriors Weighted by posteriors M-Step • Fixed the memberships • Re-estimate the two model Gaussian

EM Algorithm for GMM • Re-estimate the memberships for each bin • Re-estimate the models

At the 5-th Iteration • Red Gaussian component slowly shifts toward the left end of the x axis

At the10-th Iteration • Red Gaussian component still slowly shifts toward the left end of the x axis

At the 20-th Iteration • Red Gaussian component make more noticeable shift toward the left end of the x axis

At the 50-th Iteration • Red Gaussian component is close to the desirable location

At the 100-th Iteration • The results are almost identical to the ones for the 50-th iteration

EM as A Bound Optimization • EM algorithm in fact maximizes the log-likelihood function of training data • Likelihood for a data point x • Log-likelihood of training data

Logarithm Bound Algorithm • Start with initial guess

Logarithm Bound Algorithm Touch Point • Start with initial guess • Come up with a lower bounded

Logarithm Bound Algorithm • Start with initial guess • Come up with a lower bounded • Search the optimal solution that maximizes

Logarithm Bound Algorithm • Start with initial guess • Come up with a lower bounded • Search the optimal solution that maximizes • Repeat the procedure

Logarithm Bound Algorithm Optimal Point • Start with initial guess • Come up with a lower bounded • Search the optimal solution that maximizes • Repeat the procedure • Converge to the local optimal

EM as A Bound Optimization • Parameter for previous iteration: • Parameter for current iteration: • Compute

Concave property of logarithm function

Definition of posterior

Log-Likelihood of EM Alg. Saddle points

Maximize GMM Model • What is the global optimal solution to GMM? • Maximizing the objective function of GMM is ill-posed problem

Identify Hidden Variables • For certain learning problems, identifying hidden variables is not a easy task • Consider a simple translation model • For a pair of English and Chinese sentences: • A simple translation model is • The log-likelihood of training corpus

Identify Hidden Variables • Consider a simple case • Alignment variable a(i) • Rewrite

EM Algorithm for A Translation Model • Introduce an alignment variable for each translation pair • EM algorithm for the translation model • E-step: compute the posterior for each alignment variable • M-step: estimate the translation probability Pr(e|c)

EM Algorithm for A Translation Model • Introduce an alignment variable for each translation pair • EM algorithm for the translation model • E-step: compute the posterior for each alignment variable • M-step: estimate the translation probability Pr(e|c) We are luck here. In general, this step can be extremely difficult and usually requires approximate approaches

Compute Pr(e|c) • First compute

Bound Optimization for A Translation Model

Iterative Scaling • Maximum entropy model • Iterative scaling • All features • Sum of features are constant

Iterative Scaling • Compute the empirical mean for each feature of every class, i.e., for every j and every class y • Start w1,w2 …, wc = 0 • Repeat • Compute p(y|x) for each training data point (xi, yi) using w from the previous iteration • Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y • Compute for every j and every y • Update w as

Iterative Scaling

Iterative Scaling Can we use the concave property of logarithm function? No, we can’t because we need a lower bound

Weights still couple with each other • Still need further decomposition Iterative Scaling

Iterative Scaling

Iterative Scaling Wait a minute, this can not be right! What happens?

Logarithm Bound Algorithm • Start with initial guess • Come up with a lower bounded • Search the optimal solution that maximizes

Expectation Maximization Algorithm