1 / 36

CS b553: Algorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning. Parameter Learning with Hidden Variables & Expectation Maximization. Agenda. Learning probability distributions from data in the setting of known structure, missing data Expectation-maximization (EM) algorithm. Basic Problem.

roxy
Download Presentation

CS b553: Algorithms for Optimization and Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS b553: Algorithms for Optimization and Learning Parameter Learning with Hidden Variables & Expectation Maximization

  2. Agenda • Learning probability distributions from data in the setting of known structure, missing data • Expectation-maximization (EM) algorithm

  3. Basic Problem • Given a dataset D={x[1],…,x[M]} and a Bayesian model over observed variables X and hidden (latent) variablesZ • Fit the distribution P(X,Z) to the data • Interpretation: each example x[m] is an incomplete view of the “underlying” sample (x[m],z[m]) Z X

  4. Applications • Clustering in data mining • Dimensionality reduction • Latent psychological traits (e.g., intelligence, personality) • Document classification • Human activity recognition

  5. Hidden Variables can Yield more Parsimonious Models • Hidden variables => conditional independences Z X1 X2 X3 X4 Without Z, the observables become fully dependent X1 X2 X3 X4

  6. Hidden Variables can Yield more Parsimonious Models • Hidden variables => conditional independences Z 1+4*2=9 parameters X1 X2 X3 X4 Without Z, the observables become fully dependent X1 X2 1+2+4+8=15 parameters X3 X4

  7. Generating Model These CPTs are identical and given qz z[1] z[M] qx|z x[1] x[M] These CPTs are identical and given

  8. Example: discrete variables Categorical distributions given by parameters qz P(Z[i] |qz) = Categorical(qz) qz z[1] z[M] qx|z x[1] x[M] Categorical distribution P(X[i]|z[i],qx|z[i]) = Categorical(qx|z[i]) (in other words, z[i] multiplexes between Categorical distributions)

  9. Maximum Likelihood estimation • Approach: find values of q = (qz, qx|z), and DZ=(z[1],…,z[M]) that maximize the likelihood of the data • L(q, DZ ; D) = P(D|q, DZ ) • Find arg max L(q, DZ ; D) over q, DZ

  10. Marginal Likelihood estimation • Approach: find values of q = (qz, qx|z), and that maximize the likelihood of the data without assuming values of DZ=(z[1],…,z[M]) • L(q; D) = SDzP(D, DZ |q) • Find arg max L(q; D) over q • (A partially Bayesian approach)

  11. Computational challenges • P(D|q, DZ ) and P(D,DZ | q) are easy to evaluate, but… • Maximum likelihood arg max L(q, DZ ; D) • Optimizing over M assignments to Z (|Val(Z)|M possible joint assignments) as well as continuous parameters • Maximum marginal likelihood arg max L(q; D) • Optimizing locally over continuous parameters, but objective requires summing over M assignments to Z

  12. Expectation Maximization for ML • Idea: use a coordinate ascent approach • argmaxq, DZ L(q, DZ ; D) =argmaxqmax DZ L(q, DZ ; D) • Step 1: Finding DZ*= argmax DZ L(q, DZ ; D)is easy given a fixed q • Fully observed, ML parameter estimation • Step 2: Set Q(q) = L(q, DZ*; D)Findingq*=argmaxqQ(q)is easy given that DZ is fixed • Fully observed, ML parameter estimation • Repeat steps 1 and 2 until convergence

  13. Example: Correlated variables Unrolled network Plate notation qz qz M z[1] z[M] z qx1|z qx1|z x1[1] x1[M] x1 qx1|z qx1|z x2[M] x2[1] x2

  14. Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 qz M z qx1|z x1 qx2|z x2

  15. Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.5 • qx1|z=1 = 0.4, qx1|z=2= 0.3 • qx2|z=1 = 0.7, qx2|z=2= 0.6 qz M z qx1|z x1 qx2|z x2

  16. Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.5 • qx1|z=1 = 0.4, qx1|z=2= 0.3 • qx2|z=1 = 0.7, qx2|z=2= 0.6 qz M z • Estimated Z’s • (1,1): type 1 • (1,0): type 1 • (0,1): type 2 • (0,0): type 2 qx1|z x1 qx2|z x2

  17. Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.604 • qx1|z=1 = 1, qx1|z=2= 0 • qx2|z=1 = 0.368, qx2|z=2= 0.919 qz M z • Estimated Z’s • (1,1): type 1 • (1,0): type 1 • (0,1): type 2 • (0,0): type 2 qx1|z x1 qx2|z x2

  18. Example: Correlated variables Plate notation • Suppose 2 types: • X1 != X2, random • X1,X2=1,1 with 90% chance, 0,0 otherwise • Type 1 drawn 75% of the time • X Dataset • (1,1): 222 • (1,0): 382 • (0,1): 364 • (0,0): 32 • Parameter Estimates • qz= 0.604 • qx1|z=1 = 1, qx1|z=2= 0 • qx2|z=1 = 0.368, qx2|z=2= 0.919 qz M z • Estimated Z’s • (1,1): type 1 • (1,0): type 1 • (0,1): type 2 • (0,0): type 2 qx1|z x1 qx2|z x2 Converged (true ML estimate)

  19. Example: Correlated variables Plate notation qz M z x1 Random initial guess qZ = 0.44 qX1|Z=1 = 0.97 qX2|Z=1 = 0.21 qX3|Z=1 = 0.87 qX4|Z=1 = 0.57 qX1|Z=2 = 0.07 qX2|Z=2 = 0.97 qX3|Z=2 = 0.71 qX4|Z=2 = 0.03 Log likelihood -5176 qx1|z qx2|z x2 qx3|z x3 x4 qx4|z

  20. Example: E step Plate notation • X Dataset qz M z x1 Random initial guess qZ = 0.44 qX1|Z=1 = 0.97 qX2|Z=1 = 0.21 qX3|Z=1 = 0.87 qX4|Z=1 = 0.57 qX1|Z=2 = 0.07 qX2|Z=2 = 0.97 qX3|Z=2 = 0.71 qX4|Z=2 = 0.03 Log likelihood -4401 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z

  21. Example: M step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.43 qX1|Z=1 = 0.67 qX2|Z=1 = 0.27 qX3|Z=1 = 0.37 qX4|Z=1 = 0.83 qX1|Z=2 = 0.31 qX2|Z=2 = 0.68 qX3|Z=2 = 0.31 qX4|Z=2= 0.21 Log likelihood -3033 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z

  22. Example: E step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.43 qX1|Z=1 = 0.67 qX2|Z=1 = 0.27 qX3|Z=1 = 0.37 qX4|Z=1 = 0.83 qX1|Z=2 = 0.31 qX2|Z=2 = 0.68 qX3|Z=2 = 0.31 qX4|Z=2= 0.21 Log likelihood -2965 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z

  23. Example: E step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.40 qX1|Z=1 = 0.56 qX2|Z=1 = 0.31 qX3|Z=1 = 0.40 qX4|Z=1 = 0.92 qX1|Z=2 = 0.45 qX2|Z=2 = 0.66 qX3|Z=2 = 0.26 qX4|Z=2= 0.04 Log likelihood -2859 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z

  24. Example: Last E-M step Plate notation • X Dataset qz M z x1 Current estimates qZ = 0.43 qX1|Z=1 = 0.51 qX2|Z=1 = 0.36 qX3|Z=1 = 0.35 qX4|Z=1 = 1 qX1|Z=2 = 0.53 qX2|Z=2 = 0.57 qX3|Z=2 = 0.33 qX4|Z=2= 0 Log likelihood -2683 qx1|z Z Assignments qx2|z x2 qx3|z x3 x4 qx4|z

  25. Problem: Many Local Minima • Flipping Z assignments causes large shifts in likelihood, leading to a poorly behaved energy landscape! • Solution: EM using the marginal likelihood formulation • “Soft” EM • (This is the typical form of the EM algorithm)

  26. Expectation Maximization for MML • argmaxqL(q, D) =argmaxqEDZ|D,q [L(q; DZ ,D)] • Do argmaxqEDZ|D,q[log L(q; DZ ,D)] instead • (justified later) • Step 1: Given current fixed qt,find P(Dz|qt, D) • Compute a distribution over each Z[i] • Step 2: Use these probabilities in the expectationEDZ |D,qt[log L(q, DZ ; D)] = Q(q). Now find maxqQ(q) • Fully observed, weighted, ML parameter estimation • Repeat steps 1 (expectation) and 2 (maximization) until convergence

  27. E step in detail • Ultimately, want to maximize Q(q | qt) = EDZ|D,qt [log L(q; DZ ,D)] over q • Q(q | qt) =SmSz[m] P(z[m]|x[m], qt) log P(x[m], z[m]|q) • E step computes the termswm,z(qt)=P(Z[m]=z|D, qt)over all examples m and zVal[Z]

  28. M step in detail • argmaxq Q(q | qt) = SmSzwm,z(qt)log P (x[m]|q, z[m]=z)= argmaxPm PzP (x[m]|q, z[m]=z)^(wm,z(qt)) • This is weighted ML • Each z[m] is interpreted to be observed wm,z(qt)times • Most closed-form ML expressions (Bernoulli, categorial, Gaussian) can be adopted easily to weighted case

  29. Example: Bernoulli Parameter for Z • qZ*=argmaxqzSmSzwm,zlog P (x[m],z[m]=z |qZ)= argmaxqzSmSzwm,zlog (I[z=1]qZ+I[z=0](1-qZ)=argmaxqz[log (qZ)Smwm,z=1+log(1-qZ)Smwm,z=0] • => qZ*= (Smwm,z=1)/ Sm(wm,z=1+wm,z=0) • “Expected counts” Mqt[z] = Smwm,z(qt)Express qZ* = Mqt[z=1] / Mqt[ ]

  30. Example: Bernoulli Parameters for Xi | Z • qXi|z=k*=argmaxqzSmwm,z=klog P(x[m],z[m]=k |qXi|z=k) • = argmaxqxi|z=kSmSzwm,zlog (I[xi[m]=1,z=k]qXi|z=k+I[xi[m]=0,z=k](1-qXi|z=k)= … (similar derivation) • => qXi|z=k * = Mqt[xi=1,z=k] / Mqt[z=k]

  31. EM on Prior Example (100iterations) Plate notation • X Dataset qz M z x1 Final estimates qZ = 0.49 qX1|Z=1 = 0.64 qX2|Z=1 = 0.88 qX3|Z=1 = 0.41 qX4|Z=1 = 0.46 qX1|Z=2 = 0.38 qX2|Z=2 = 0.00 qX3|Z=2 = 0.27 qX4|Z=2= 0.68 Log likelihood -2833 qx1|z P(Z)=2 qx2|z x2 qx3|z x3 x4 qx4|z

  32. Convergence • In general, no way to tell a priori how fast EM will converge • Soft EM is usually slower than hard EM • Still runs into local minima, but has more opportunities to coordinate parameter adjustments

  33. Why does it work? • Why are we optimizing over Q(q | qt) =SmSz[m] P(z[m]|x[m], qt) log P(x[m], z[m]|q) • rather than the true marginalized likelihood: L(q|D) = Pm Sz[m] P(z[m]|x[m], qt) P(x[m], z[m]|q)?

  34. Why does it work? • Why are we optimizing over Q(q | qt) =SmSz[m] P(z[m]|x[m], qt) log P(x[m], z[m]|q) • rather than the true marginalized likelihood: L(q|D) = Pm Sz[m] P(z[m]|x[m], qt) P(x[m], z[m]|q)? • Can prove that: • The log likelihood is increased at every step • A stationary point of argmaxqEDZ|D,q [L(q; DZ ,D)] is a stationary point of log L(q|D) • see K&F p882-884

  35. Gaussian Clustering using EM • One of the first uses of EM • Widely used approach • Finding good starting points: • k-means algorithm • (Hard assignment) • Handling degeneracies • Regularization

  36. Recap • Learning with hidden variables • Typically categorical

More Related