1 / 17

The EM Method

The EM Method. Arthur Pece aecp@diku.dk Basic concepts EM clustering algorithm EM method and relationship to ML estimation. What is EM?. Expectation-Maximization A fairly general optimization method Useful when the model includes 3 kinds of variables: visible variables x

nairi
Download Presentation

The EM Method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The EM Method Arthur Pece aecp@diku.dk Basic concepts EM clustering algorithm EM method and relationship to ML estimation

  2. What is EM? • Expectation-Maximization • A fairly general optimization method • Useful when the model includes 3 kinds of variables: • visible variables x • intermediate variables h * • parameters/state variables s and we want to optimize only w.r.t. the parameters. * Here we assume that the intermediate variables are discrete

  3. EM Method • A method to obtain ML parameter estimates -> maximize log-likelihood w.r.t. parameters. Assuming that the xi are statistically independent: likelihood for the data set = sum of likelihoods for the data points: L = Si log p(xi | s) = Si log Skp(xi | hk,s) p (hk | s) (replace 2nd sum with an integral if intermediate variables are continuous rather than discrete)

  4. EM functional Given a pdf q(h) for the intermediate variables we define the EMfunctional: Qq = SiSkq(hk) log p(xi | hk,s) p (hk | s) This is usually much simpler than the log-likelihood: L = Si log Skp(xi | hk,s) p (hk | s) because there is no logarithm of a sum in Qq .

  5. EM iteration Two steps: E and M • E step: q(h) is set equal to the pdf of h conditional on xi and the current estimate s(t) of s: q(t)(hk) = p(hk | xi, s(t-1)) • M step: the EM functional is maximized w.r.t. s to obtain s(t).

  6. Example: EM clustering • m data points xi are generated by n generative processes, each process j generating a fraction wj of the data points with pdf fj (xi), parameterized by the parameter set sj (which includes wj) • We want to estimate the parameters sj for all processes

  7. Example: EM clustering • Visible variables: m data points xi • Intermediate variables: m xn binary labels hij, Sjhij = 1 • State variables: n parameter sets sj

  8. EM clustering for Gaussian pdf’s • The parameters are weight wj, centroid cj, covariance Aj • If we knew which data point belongs to which cluster, we could compute fraction, mean and covariance for each cluster: wj = Sihij/m cj = Sihijxi / wj Aj = Sihij (xi - cj) (xi - cj)T / wj

  9. EM clustering (continued) • Since we do not know which cluster a data point belongs to, we assign each point to all clusters, with different probabilities qij, Sjqij = 1: wj = Siqij cj = Siqijxi / wj Aj = Siqij (xi - cj) (xi - cj)T / wj

  10. EM clustering (continued) • The probabilities qij can be computed from the cluster parameters • Chicken & egg problem: the cluster parameters are needed to compute the probabilities, and the probabilities are needed to compute the cluster parameters

  11. EM clustering (continued) The solution: iterate to convergence: • E step: for each data point and each cluster, compute the probability qij that the point belongs to the cluster (from the cluster parameters) • M step: re-compute the cluster parameters for all clusters by weighted averages over all points (use the equations given 2 slides ago).

  12. How to compute the probability that a given data point originates from a given process? • Use Bayes’ theorem: qij = wjfj (xi) / Skwkfk (xi) This is how the cluster parameters are used to compute the qij

  13. Non-decreasing log-likelihoodin the EM method Let’s return to the general EM method: we want to prove that the log-likelihood does not decrease from one iteration to the next. To do so we introduce 2 more functionals.

  14. Entropy and Kullback-Leibler divergence Define the entropy S(q) = -SiSkq(hk) log q(hk) and the Kullback-Leibler divergence DKL[q ; p(h| x, s)] = Si Skq(hk) log [q(hk) /p(hk | xi, s)]

  15. Non-decreasing log-likelihood I It can be proven that L = Qq + S(q) + DKL[q ; p(h| x, s)] After the E step, q(t)(h) = p(h| x, s(t-1)) and thereforeDKL is zero: L (t-1)= Qq(t-1) + S(q (t))

  16. Non-decreasing log-likelihood II After the M step, Qq is maximized in standard EM [ Qq is increased but not maximized in GEM (generalized EM) but the result is the same ] and therefore: Qq(t) sQq(t-1) In addition we have that: L (t)sQq (t) + S(q(t)) [ This is because, for any two pdf’s q and p: DKL[q ; p] s 0 ]

  17. Non-decreasing log-likelihood III Putting the above results together: L (t)sQq (t) + S(q(t) ) sQq (t-1) + S(q(t) ) = L (t-1) which proves that L is non-decreasing.

More Related