1 / 75

Andrew Blake, Microsoft Research and Bill Freeman, MIT, ICCV 2003 and

Introduction to Expectation Maximization Assembled and extended by Longin Jan Latecki Temple University, latecki@temple.edu based on slides by. Andrew Blake, Microsoft Research and Bill Freeman, MIT, ICCV 2003 and Andrew W. Moore, Carnegie Mellon University.

enoch
Download Presentation

Andrew Blake, Microsoft Research and Bill Freeman, MIT, ICCV 2003 and

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Expectation MaximizationAssembled and extended by Longin Jan LateckiTemple University, latecki@temple.edubased on slides by Andrew Blake, Microsoft Research and Bill Freeman, MIT, ICCV 2003 and Andrew W. Moore, Carnegie Mellon University

  2. Learning and vision: Generative Methods • Machine learning is an important direction in computer vision. • Our goal for this class: • Give overviews of useful techniques. • Show how these methods can be used in vision. • Provide references and pointers.

  3. What is the goal of vision? If you are asking, “Are there any faces in this image?”, then you would probably want to use discriminative methods.

  4. What is the goal of vision? If you are asking, “Are there any faces in this image?”, then you would probably want to use discriminative methods. If you are asking, “Find a 3-d model that describes the runner”, then you would use generative methods.

  5. Modeling So we want to look at high-dimensional visual data, and fit models to it; forming summaries of it that let us understand what we see.

  6. The simplest data to model:a set of 1–d samples

  7. Fit this distribution with a Gaussian

  8. Posterior probability Likelihood function Prior probability mean data points std. dev. Evidence By Bayes rule How find the parameters of the best-fitting Gaussian?

  9. Posterior probability Likelihood function Prior probability mean data points std. dev. Evidence Maximum likelihood parameter estimation: How find the parameters of the best-fitting Gaussian?

  10. Derivation of MLE for Gaussians Observation density Log likelihood Maximisation

  11. Basic Maximum Likelihood Estimate (MLE) of a Gaussian distribution Mean Variance Covariance Matrix

  12. Basic Maximum Likelihood Estimate (MLE) of a Gaussian distribution Mean Variance For vector-valued data, we have the Covariance Matrix

  13. Model fitting example 2: Fit a line to observed data y x

  14. Maximum likelihood estimation for the slope of a single line Data likelihood for point n: Maximum likelihood estimate: where gives regression formula

  15. Model fitting example 3:Fitting two lines to observed data y x

  16. MLE for fitting a line pair (a form of mixture dist. for )

  17. Line 1 Line 2 Fitting two lines: on the one hand… If we knew which points went with which lines, we’d be back at the single line-fitting problem, twice. y x

  18. Fitting two lines, on the other hand… We could figure out the probability that any point came from either line if we just knew the two equations for the two lines. y x

  19. Expectation Maximization (EM): a solution to chicken-and-egg problems

  20. EM example:

  21. EM example:

  22. EM example:

  23. EM example:

  24. EM example: Converged!

  25. MLE with hidden/latent variables:Expectation Maximisation General problem: data parameters hidden variables For MLE, want to maximise the log likelihood The sum over z inside the log gives a complicated expression for the ML solution.

  26. The EM algorithm We don’t know the values of the labels, zi , but let’s use its expected value under its posterior with the current parameter values, old. That gives us the “expectation step”: “E-step” Now let’s maximize this Q function, an expected log-likelihood, over the parameter values, giving the “maximization step”: “M-step” Each iteration increases the total log-likelihood log p(y|)

  27. Expectation Maximisation applied to fitting the two lines Need: /2 and then: and maximising that gives associate data point with line Hidden variables and probabilities of association are : ’ ’

  28. EM fitting to two lines with /2 “E-step” and repeat Regression becomes: “M-step”

  29. Experiments: EM fitting to two lines (from a tutorial by Yair Weiss, http://www.cs.huji.ac.il/~yweiss/tutorials.html) Line weights line 1 line 2 Iteration 1 2 3

  30. Applications of EM in computer vision • Image segmentation • Motion estimation combined with perceptual grouping • Polygonal approximation of edges

  31. Next… back to Density EstimationWhat if we want to do density estimation with multimodal or clumpy data?

  32. m1 The GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi m2 m3

  33. m1 The GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix s2I • Assume that each datapoint is generated according to the following recipe: m2 m3

  34. The GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix s2I • Assume that each datapoint is generated according to the following recipe: • Pick a component at random. Choose component i with probability P(wi). m2

  35. The GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix s2I • Assume that each datapoint is generated according to the following recipe: • Pick a component at random. Choose component i with probability P(wi). • Datapoint ~ N(mi, s2I ) m2 x

  36. m1 The General GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix Si • Assume that each datapoint is generated according to the following recipe: • Pick a component at random. Choose component i with probability P(wi). • Datapoint ~ N(mi, Si ) m2 m3

  37. Unsupervised Learning:not as hard as it looks IN CASE YOU’RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW 2-d UNLABELED DATA (X VECTORS) DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS

  38. Computing likelihoods in unsupervised case We have x1, x2 , … xN We know P(w1) P(w2) .. P(wk) We knowσ P(x|wi, μi, … μk) = Prob that an observation from class wi would have value x given class means μ1… μk Can we write an expression for that?

  39. likelihoods in unsupervised case We have x1 x2 … xn We have P(w1) .. P(wk). We have σ. We can define, for any x , P(x|wi , μ1, μ2..μk) Can we define P(x | μ1, μ2..μk) ? Can we define P(x1, x2, .. xn| μ1, μ2..μk) ? [YES, IF WE ASSUME THE X1’S WERE DRAWN INDEPENDENTLY]

  40. Unsupervised Learning:Mediumly Good News We now have a procedure s.t. if you give me a guess at μ1, μ2..μk, I can tell you the prob of the unlabeled data given those μ‘s. Suppose x‘s are 1-dimensional. There are two classes; w1 and w2 P(w1) = 1/3 P(w2) = 2/3 σ = 1 . There are 25 unlabeled datapoints (From Duda and Hart) x1 = 0.608 x2 = -1.590 x3 = 0.235 x4 = 3.949 : x25 = -0.712

  41. Duda & Hart’s Example Graph of log P(x1, x2 .. x25 | μ1, μ2 ) against μ1 () and μ2 () Max likelihood = (μ1=-2.13, μ2 =1.668) Local minimum, but very close to global at (μ1=2.085, μ2 =-1.257)* * corresponds to switching w1 with w2.

  42. Duda & Hart’s Example We can graph the prob. dist. function of data given our μ1 and μ2 estimates. We can also graph the true function from which the data was randomly generated. • They are close. Good. • The 2nd solution tries to put the “2/3” hump where the “1/3” hump should go, and vice versa. • In this example unsupervised is almost as good as supervised. If the x1 .. x25 are given the class which was used to learn them, then the results are (μ1=-2.176, μ2=1.684). Unsupervised got (μ1=-2.13, μ2=1.668).

  43. Finding the max likelihood μ1,μ2..μk We can compute P( data | μ1,μ2..μk) How do we find the μi‘s which give max. likelihood? • The normal max likelihood trick: Set  log Prob (….) = 0 μi and solve for μi‘s. # Here you get non-linear non-analytically- solvable equations • Use gradient descent Slow but doable • Use a much faster, cuter, and recently very popular method…

  44. Expectation Maximalization

  45. The E.M. Algorithm DETOUR • We’ll get back to unsupervised learning soon. • But now we’ll look at an even simpler case with hidden information. • The EM algorithm • Can do trivial things, such as the contents of the next few slides. • An excellent way of doing our unsupervised learning problem, as we’ll see. • Many, many other uses, including inference of Hidden Markov Models.

  46. Silly Example Let events be “grades in a class” w1 = Gets an A P(A) = ½ w2 = Gets a B P(B) = μ w3 = Gets a C P(C) = 2μ w4 = Gets a D P(D) = ½-3μ (Note 0 ≤μ≤1/6) Assume we want to estimate μ from data. In a given class there were a A’s b B’s c C’s d D’s What’s the maximum likelihood estimate of μ given a,b,c,d ?

  47. Silly Example Let events be “grades in a class” w1 = Gets an A P(A) = ½ w2 = Gets a B P(B) = μ w3 = Gets a C P(C) = 2μ w4 = Gets a D P(D) = ½-3μ (Note 0 ≤μ≤1/6) Assume we want to estimate μ from data. In a given class there were a A’s b B’s c C’s d D’s What’s the maximum likelihood estimate of μ given a,b,c,d ?

  48. Trivial Statistics P(A) = ½ P(B) = μ P(C) = 2μ P(D) = ½-3μ P( a,b,c,d | μ) = K(½)a(μ)b(2μ)c(½-3μ)d log P( a,b,c,d | μ) = log K + alog ½ + blog μ + clog 2μ + dlog (½-3μ) Boring, but true!

  49. Same Problem with Hidden Information REMEMBER P(A) = ½ P(B) = μ P(C) = 2μ P(D) = ½-3μ Someone tells us that Number of High grades (A’s + B’s) = h Number of C’s = c Number of D’s = d What is the max. like estimate of μ now?

  50. Same Problem with Hidden Information REMEMBER P(A) = ½ P(B) = μ P(C) = 2μ P(D) = ½-3μ Someone tells us that Number of High grades (A’s + B’s) = h Number of C’s = c Number of D’s = d What is the max. like estimate of μ now? We can answer this question circularly: EXPECTATION If we know the value of μ we could compute the expected value of a and b Since the ratio a:b should be the same as the ratio ½ : m MAXIMIZATION If we know the expected values of a and b we could compute the maximum likelihood value of μ

More Related