Machine Learning Saarland University, SS 2007

Machine LearningSaarland University, SS 2007 Lecture 9, Friday June 15th, 2007 (EM algorithm + convergence) Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany

Overview of this Lecture • Quick recap of last lecture • maximum likelihood principle / our 3 examples • The EM algorithm • writing down the formula (very easy) • understanding the formula (very hard) • Example: mixture of two normal distributions • Convergence • to local maximum (under mild assumptions) • Exercise Sheet • explain / discuss / make a start

Maximum Likelihood: Example 1 • Sequence of coin flips HHTTTTTTHTTTTTHTTHHT • say 5 times H and 15 times T • which Prob(H) and Prob(T) are most likely? • Formalization • Data X = (x1, … , xn),xi in {H,T} • Parameters Θ = (pH, pT), pH + pT = 1 • Likelihood L(X,Θ) =pHh · pTt, h = #{i : xi = H}, t = #{i : xi = T} • Log Likelihood Q(X,Θ) = log L(X,Θ) = h · log pH + t · log pT • find Θ* = argmaxΘ L(X,Θ) = argmaxΘ Q(X,Θ) • Solution • here pH = h / (h + t) and pT = t / (h + t) looks likeProb(H) = ¼ Prob(T) = ¾ simple calculus[blackboard]

Maximum Likelihood: Example 2 • Sequence of reals drawn from N(μ, σ) • which μ and σ are most likely? • Formalization • Data X = (x1, … , xn),xi real number • Parameters Θ = (μ, σ) • Likelihood L(X,Θ) = πi 1/(sqrt(2π)σ) · exp( - (xi - μ)2 / 2σ2 ) • Log Likelihood Q(X,Θ) = - n/2·log(2π) - n·logσ – Σi (xi - μ)2 / 2σ2 • find Θ* = argmaxΘ L(X,Θ) = argmaxΘ Q(X,Θ) • Solution • here μ = 1/n * Σixi and σ2 = 1/n * Σi (xi - μ)2 normal distributionwith mean μ and standard deviation σ simple calculus[blackboard]

Maximum Likelihood: Example 3 • Sequence of real numbers • each drawn from either N1(μ1, σ1) or N2(μ2, σ2) • from N1 with prob p1, and from N2 with prob p2 • which μ1, σ1, μ2, σ2, p1, p2 are most likely? • Formalization • Data X = (x1, … , xn),xi real number • Hidden dataZ = (z1, … , zn), zi = j iff xi drawn from Nj • Parameters Θ = (μ1, σ1, μ2, σ2, p1, p2), p1 + p2 = 1 • Likelihood L(X,Θ) = [blackboard] • Log Likelihood Q(X,Θ) = [blackboard] • find Θ* = argmaxΘ L(X,Θ) = argmaxΘ Q(X,Θ) standard calculus fails (derivative of sum of logs of sum)

The EM algorithm — Formula • Given • Data X = (x1, … ,xn) • Hidden dataZ = (z1, … ,zn) • Parameters Θ + an initial guess θ1 • Expectation-Step: • Pr(Z|X;θt) = Pr(X|Z;θt) ∙ Pr(Z|θt) /ΣZ’Pr(X|Z’;θt) ∙ Pr(Z’|θt) • Maximization-Step: • θt+1 = argmaxΘ EZ[ log Pr(X,Z|Θ) | X;θt ] What the hell does this mean? crucial to understand each of these probabilities / expected values What is fixed? What is random and how? What do the conditionals mean?

Three attempts to maximize the likelihood consider the mixture of twoGaussians as an example • The direct way … • given x1, … ,xn • find parameters μ1, σ1, μ2, σ2, p1, p2 • such that log L(x1, … ,xn) is maximized • If only we knew … • given data x1, … ,xn and hidden data z1, … ,zn • find parameters μ1, σ1, μ2, σ2, p1, p2 • such that log L(x1, … ,xn, z1, … ,zn) is maximized • The EM way … • given x1, … ,xn and random variables Z1, … ,Zn • find parameters μ1, σ1, μ2, σ2, p1, p2 • such that E log L(x1, … ,xn, Z1, … ,Zn) is maximized optimization too hard(sum of logs of sums) would be feasible [show on blackboard] but we don’t knowthe z1, … ,zn E-Step provides the Z1, … ,Zn M-Step of theEM algorithm

E-Step — Formula consider the mixture of twoGaussians as an example • We have (at the beginning of each iteration) • the data x1, … ,xn • the fully specified distributionsN1(μ1,σ1) and N2(μ2,σ2) • the probability of choosing between N1and N2 = random variable Z with p1 = Pr(Z=1) and p2= Pr(Z=2) • We want • for each data point xi a probability of choosing N1 or N2 = random variables Z1, … ,Zn • Solution (the actual E-Step) • take Zi asthe conditional Z | Xi • Pr(Zi=1) = Pr(Z=1 | xi) = Pr(xi | Z=1) ∙ Pr(Z=1) / Pr(xi) with Pr(xi) = ΣZ Pr(xi | Z=z) ∙ Pr(Z=z) Bayes’ law

E-Step — analogy to a simple example • Draw ball from one of two urns Pr(Urn 1) = 1/3, Pr(Urn 2) = 2/3 Pr(Blue | Urn 1) = 1/2, Pr(Blue | Urn 2) = 1/4 Pr(Blue) = Pr(Blue | Urn 1) ∙ Pr(Urn 1) + Pr(Blue | Urn 2) ∙ Pr(Urn2) = 1/2 ∙ 1/3 + 2/3 ∙ 1/4 = 1/3 Pr(Urn 1 | Blue) = Pr(B | Urn 1) ∙ Pr(Urn 1) / Pr(B) = 1/2 ∙ 1/3 / 1/3 = 1/2 Urn 2pick with prob 2/3 Urn 1pick with prob 1/3

M-Step — Formula • [Blackboard]

Convergence of EM Algorithm • Two (log) likelihoods • true: log L(x1,…,xn) • EM: E log L(x1,…,xn, Z1,…,Zn) • Lemma 1 (lower bound) • E log L(x1,…,xn, Z1,…,Zn) ≤ log L(x1,…,xn) • Lemma 2 (touch) • E log L(x1,…,xn, Z1,…,Zn)(θt) = log L(x1,…,xn)(θt) • Convergence • if expected likelihood function is well-behaved, e.g., if first derivate at local maxima exist and second derivate is < 0 • then Lemmas 1 and 2 imply convergence [blackboard] [blackboard]

2 2 ( ) ( ) ¡ ¡ x ¹ x ¹ i i 1 2 ¡ ¡ 1 1 2 2 Y Y 2 2 ¾ ¾ ( ) L 1 2 ¢ ¢ ¢ x x z z e e = 1 1 n n ; : : : ; ; ; : : : ; p p 2 2 ¼ ¾ ¼ ¾ 1 2 I I i i 2 2 1 2 Attempt Two: Calculations • If only we knew … • given data x1, … ,xn and hidden data z1, … ,zn • find parameters μ1, σ1, μ2, σ2, p1, p2 • such that log L(x1, … ,xn, z1, … ,zn) is maximized • let I1 = {i : zi = 1} and I2 = {i : zi = 2} • The two products can be maximized separately • here μ1 = Σi in I1xi / |I1| and σ12 = Σ i in I1(xi – μ1)2 /|I1| • here μ2 = Σi in I2xi / |I2| and σ22 = Σ i in I2(xi – μ2)2 /|I2|

Machine Learning Saarland University, SS 2007