1 / 14

EE-148 Expectation Maximization

EE-148 Expectation Maximization. Markus Weber 5/11/99. Overview. Expectation Maximization is a technique used to estimate probability densities under missing (unobserved) data. Density Estimation Observed vs. Missing Data EM. Probability Density Estimation Why is it important?.

Download Presentation

EE-148 Expectation Maximization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EE-148Expectation Maximization Markus Weber 5/11/99

  2. Overview • Expectation Maximization is a technique used to estimate probability densities under missing (unobserved) data. • Density Estimation • Observed vs. Missing Data • EM

  3. Probability Density EstimationWhy is it important? • It is the essence of • Pattern Recognition  estimate p(class|observations) from training data • Learning Theory (includes pattern recognition) • Many other methods use density estimation • HMM • Kalman Filters

  4. Probability Density Estimationhow does it work? • Given: samples, {xi} • Two major philosophies: Parametric • Provide a parametrized class of density functions, e.g. • Gaussians: p(x) = f(x, mean, Cov) • Mixture of Gaussians: p(x) = f(x, Mean, Cov, • Estimation means to find the parameters which best model the data. Measure? Maximum Likelihood! • Choice of class reflects prior knowledge Non-Parametric • Density is modeled explicitly through the samples, e.g. • Parzen Windows (Rosenblatt, ‘56; Parzen, ‘62): Make a histogram and convolve with kernel (could be Gaussian) • K-nearest-neighbor • Prior knowledge less prominent

  5. Maximum Likelihood • The standard method (besides Bayesian inference) for parametric density estimation • Def. Likelihood of a parameter (independent samples): • Oftentimes one uses negative log-likelihood: • We can use this as an error function while estimating . • For a Gaussian, the sample mean and covariance are the ML estimates!

  6. Missing Data Problems • Occur whenever part of the data is unknown • intrinsically inaccessible, e.g.: • Constellation models: p(X,N,d,h,O)  d,h not known!  p(O|X,N)? • Gaussian mixture models: which cluster does a data point belong to? • data is lost/erroneous e.g.: • Some faulty/noisy process has generated the data. • You erased the wrong file and part of your data is gone. • If the missing data is correlated in any way with the observed, we can hope to extract information about the missing data from the observed. • If the missing data is independent from the observed, everything is lost.

  7. p(z|y,) p(y|z,) z y missing z missing y Example Problem • Samples, xi RN, from a joint Gaussian • In some xi, some dimensions are lost/unobserved. We know where data is missing. • How can we estimate theparameters of the Gaussian? • How can we replace themissing data? • Essential EM ideas: • If we had an estimate of the joint density, the conditional densities would tell us how the missing data is distributed. • If we had an estimate of the missing data distribution, we could use it to estimate the joint density. • There is a way to iterate the above two steps which will steadily improve the overall likelihood. p(y,z|)

  8. Expectation Maximization • Task: Estimate p(y,z|,) (Gaussian) from the available data. • We want to do maximum likelihood density estimation, although we do not know all the data. If we knew the missing data, we would minimize the negative log-likelihood (o = observed, m = missing): • Since we do not know the missing data, Xm, we want to minimize the likelihood of the observed data, Xo: • We propose an iterative solution (1,2,...) • But for now we are still stuck with a log of an integral.

  9. Some Rewriting • We write pn(.) for p(.|n). • We can rewrite the expression for the likelihood:and similarly:Now we use Jensen’s inequality:and obtain

  10. The Gist • This shows that, if we minimize Q(, ) with respect to the second argument, the likelihood can only increase. • We still need to show that after convergence (at a minimum of Q) we have reached a minimum of thelikelihood. We will skip this part here. • To summarize: At every iteration, we need to minimize the following expression with respect to n • This corresponds to our initial intuition: We minimize the expectation of the joint likelihood, where expectation is computed using conditional of the previously estimated density.

  11. Specific Example • This is what we need to minimize: • Concrete: the Gaussian example, zT = (xo, xm)T: • Update rule for : • Update rule for  is equally simple to derive.

  12. Solution to Example Problem(for proof see separate handout) • Compute at each iteration:

  13. Demo

  14. Other Applications of EM • Estimating mixture densities • Learning constellation models from unlabeled data • Many problems can be formulated in an EM framework: • HMMs • PCA • Latent variable models • “condensation” algorithm (learning complex motions) • many computer vision problems

More Related