190 likes | 337 Views
The EM Method. Arthur Pece aecp@diku.dk Basic concepts EM clustering algorithm EM method and relationship to ML estimation. What is EM?. Expectation-Maximization A fairly general optimization method Useful when the model includes 3 kinds of variables: visible variables x
E N D
The EM Method Arthur Pece aecp@diku.dk Basic concepts EM clustering algorithm EM method and relationship to ML estimation
What is EM? • Expectation-Maximization • A fairly general optimization method • Useful when the model includes 3 kinds of variables: • visible variables x • intermediate variables h * • parameters/state variables s and we want to optimize only w.r.t. the parameters. * Here we assume that the intermediate variables are discrete
EM Method • A method to obtain ML parameter estimates -> maximize log-likelihood w.r.t. parameters. Assuming that the xi are statistically independent: likelihood for the data set = sum of likelihoods for the data points: L = Si log p(xi | s) = Si log Skp(xi | hk,s) p (hk | s) (replace 2nd sum with an integral if intermediate variables are continuous rather than discrete)
EM functional Given a pdf q(h) for the intermediate variables we define the EMfunctional: Qq = SiSkq(hk) log p(xi | hk,s) p (hk | s) This is usually much simpler than the log-likelihood: L = Si log Skp(xi | hk,s) p (hk | s) because there is no logarithm of a sum in Qq .
EM iteration Two steps: E and M • E step: q(h) is set equal to the pdf of h conditional on xi and the current estimate s(t) of s: q(t)(hk) = p(hk | xi, s(t-1)) • M step: the EM functional is maximized w.r.t. s to obtain s(t).
Example: EM clustering • m data points xi are generated by n generative processes, each process j generating a fraction wj of the data points with pdf fj (xi), parameterized by the parameter set sj (which includes wj) • We want to estimate the parameters sj for all processes
Example: EM clustering • Visible variables: m data points xi • Intermediate variables: m xn binary labels hij, Sjhij = 1 • State variables: n parameter sets sj
EM clustering for Gaussian pdf’s • The parameters are weight wj, centroid cj, covariance Aj • If we knew which data point belongs to which cluster, we could compute fraction, mean and covariance for each cluster: wj = Sihij/m cj = Sihijxi / wj Aj = Sihij (xi - cj) (xi - cj)T / wj
EM clustering (continued) • Since we do not know which cluster a data point belongs to, we assign each point to all clusters, with different probabilities qij, Sjqij = 1: wj = Siqij cj = Siqijxi / wj Aj = Siqij (xi - cj) (xi - cj)T / wj
EM clustering (continued) • The probabilities qij can be computed from the cluster parameters • Chicken & egg problem: the cluster parameters are needed to compute the probabilities, and the probabilities are needed to compute the cluster parameters
EM clustering (continued) The solution: iterate to convergence: • E step: for each data point and each cluster, compute the probability qij that the point belongs to the cluster (from the cluster parameters) • M step: re-compute the cluster parameters for all clusters by weighted averages over all points (use the equations given 2 slides ago).
How to compute the probability that a given data point originates from a given process? • Use Bayes’ theorem: qij = wjfj (xi) / Skwkfk (xi) This is how the cluster parameters are used to compute the qij
Non-decreasing log-likelihoodin the EM method Let’s return to the general EM method: we want to prove that the log-likelihood does not decrease from one iteration to the next. To do so we introduce 2 more functionals.
Entropy and Kullback-Leibler divergence Define the entropy S(q) = -SiSkq(hk) log q(hk) and the Kullback-Leibler divergence DKL[q ; p(h| x, s)] = Si Skq(hk) log [q(hk) /p(hk | xi, s)]
Non-decreasing log-likelihood I It can be proven that L = Qq + S(q) + DKL[q ; p(h| x, s)] After the E step, q(t)(h) = p(h| x, s(t-1)) and thereforeDKL is zero: L (t-1)= Qq(t-1) + S(q (t))
Non-decreasing log-likelihood II After the M step, Qq is maximized in standard EM [ Qq is increased but not maximized in GEM (generalized EM) but the result is the same ] and therefore: Qq(t) sQq(t-1) In addition we have that: L (t)sQq (t) + S(q(t)) [ This is because, for any two pdf’s q and p: DKL[q ; p] s 0 ]
Non-decreasing log-likelihood III Putting the above results together: L (t)sQq (t) + S(q(t) ) sQq (t-1) + S(q(t) ) = L (t-1) which proves that L is non-decreasing.