Mixture Language Models and EM Algorithm

Mixture Language Models and EM Algorithm (Lecture for CS397-CXZ Intro Text Info Systems) Sept. 17, 2003 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Rest of this Lecture • Unigram mixture models • Slightly more sophisticated unigram LMs • Related to smoothing • EM algorithm • VERY useful for estimating parameters of a mixture model or when latent/hidden variables are involved • Will occur again and again in the course

Modeling a Multi-topic Document A document with 2 types of vocabulary … text mining passage food nutrition passage text mining passage text mining passage food nutrition passage … How do we model such a document? How do we “generate” such a document? How do we estimate our model? Solution: A mixture model + EM

Simple Unigram Mixture Model … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … =0.7 Model/topic 1 p(w|1) … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … 1-=0.3 Model/topic 2 p(w|2) p(w|1 2) = p(w|1)+(1- )p(w|2)

The doc is about text mining and food nutrition, how much percent is about text mining? 30% of the doc is about text mining, what’s the rest about? The doc is about text mining, is it also about some other topic, and if so to what extent? 30% of the doc is about one topic and 70% is about another, what are these two topics? The doc is about two subtopics, find out what these Two subtopics are and to what extent the doc covers Each. Parameter Estimation Likelihood: • Estimation scenarios: • p(w|1) & p(w|2) are known; estimate  • p(w|1) &  are known; estimate p(w|2) • p(w|1) is known; estimate  & p(w|2) •  is known; estimate p(w|1)& p(w|2) • Estimate , p(w|1), p(w|2) =clustering

No closed form solution, But many numerical algorithms Likelihood logL() next guess current guess Lower bound  (n) (n+1) Parameter Estimation Example:Given p(w|1) and p(w|2), estimate Maximum Likelihood: Expectation-Maximization (EM) Algorithm is a commonly used method Basic idea: Start from some random guess of parameter values, and then Iteratively improve our estimates (“hill climbing”) E-step: compute the lower bound M-step: find a new  that maximizes the lower bound

From 2 EM Algorithm: Intuition … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Observed Doc d =? p(w|1) … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … 1-=? p(w|2) Suppose we know the identity of each word… p(w|1 2) = p(w|1)+(1- )p(w|2)

E-step M-step Can We “Guess” the Identity? Identity (“hidden”) variable: zw{1 (w from 1), 0(w from 2)} zw 1 1 1 1 0 0 0 1 0 ... What’s a reasonable guess? - depends on  (why?) - depends on p(w| 1)) and p(w|2) (how?) the paper presents a text mining algorithm the paper ... Initially, set  to some random value, then iterate …

An Example of EM Computation

Any Theoretical Guarantee? • EM is guaranteed to reach a LOCAL maximum • When “local maxima” = “global maxima”, EM can find the global maximum • But, when there are multiple local maximas, “special techniques” are needed (e.g., try different initial values) • In our case, we have one unique local maxima (why?)

A General Introduction to EM Data: X (observed) + H(hidden) Parameter:  “Incomplete” likelihood: L( )= log p(X| ) “Complete” likelihood: Lc( )= log p(X,H| ) EM tries to iteratively maximize the complete likelihood: Starting with an initial guess (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute (n) by maximizing the Q-function

Convergence Guarantee Goal: maximizing “Incomplete” likelihood: L( )= log p(X| ) I.e., choosing (n), so thatL((n))-L((n-1))0 Note that, sincep(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))] Taking expectation w.r.t.p(H|X, (n-1)), L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n))) EM chooses (n) to maximize Q KL-divergence, always non-negative Therefore, L((n))  L((n-1))!

Another way of looking at EM Likelihood p(X| ) L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) )+ D(p(H|X,  (n-1) )||p(H|X,  )) next guess current guess Lower bound (Q function)  E-step = computing the lower bound M-step = maximizing the lower bound

What You Should Know • What is unigram language so important? • What is a unigram mixture language model? • How to estimate parameters of simple unigram mixture models using EM • Know the general idea of EM (EM will be covered again later in the course)

Mixture Language Models and EM Algorithm

Mixture Language Models and EM Algorithm

Presentation Transcript

The EM algorithm

EM algorithm

EM algorithm and applications

Mixture Language Models

The EM algorithm

Gaussian Mixture Models and Expectation Maximization

Mixture Models And Expectation Maximization

Gaussian Mixture Models and Acoustic Modeling

The EM algorithm

EM Algorithm

Phylogeny of Mixture Models

Gaussian Mixture Models

Applying Finite Mixture Models

3D Geometry Coding using Mixture Models and the Estimation Quantization Algorithm

Mixture Models on Graphs

Applying Finite Mixture Models

EM Algorithm and its Applications

EM Algorithm

Gaussian Mixture Models and Expectation Maximization

Gaussian Mixture Models

EM Algorithm and its Applications

Mixture Models