150 likes | 257 Views
Mixture Language Models and EM Algorithm. (Lecture for CS397-CXZ Intro Text Info Systems) Sept. 17, 2003 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Rest of this Lecture. Unigram mixture models Slightly more sophisticated unigram LMs
E N D
Mixture Language Models and EM Algorithm (Lecture for CS397-CXZ Intro Text Info Systems) Sept. 17, 2003 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
Rest of this Lecture • Unigram mixture models • Slightly more sophisticated unigram LMs • Related to smoothing • EM algorithm • VERY useful for estimating parameters of a mixture model or when latent/hidden variables are involved • Will occur again and again in the course
Modeling a Multi-topic Document A document with 2 types of vocabulary … text mining passage food nutrition passage text mining passage text mining passage food nutrition passage … How do we model such a document? How do we “generate” such a document? How do we estimate our model? Solution: A mixture model + EM
Simple Unigram Mixture Model … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … =0.7 Model/topic 1 p(w|1) … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … 1-=0.3 Model/topic 2 p(w|2) p(w|1 2) = p(w|1)+(1- )p(w|2)
The doc is about text mining and food nutrition, how much percent is about text mining? 30% of the doc is about text mining, what’s the rest about? The doc is about text mining, is it also about some other topic, and if so to what extent? 30% of the doc is about one topic and 70% is about another, what are these two topics? The doc is about two subtopics, find out what these Two subtopics are and to what extent the doc covers Each. Parameter Estimation Likelihood: • Estimation scenarios: • p(w|1) & p(w|2) are known; estimate • p(w|1) & are known; estimate p(w|2) • p(w|1) is known; estimate & p(w|2) • is known; estimate p(w|1)& p(w|2) • Estimate , p(w|1), p(w|2) =clustering
No closed form solution, But many numerical algorithms Likelihood logL() next guess current guess Lower bound (n) (n+1) Parameter Estimation Example:Given p(w|1) and p(w|2), estimate Maximum Likelihood: Expectation-Maximization (EM) Algorithm is a commonly used method Basic idea: Start from some random guess of parameter values, and then Iteratively improve our estimates (“hill climbing”) E-step: compute the lower bound M-step: find a new that maximizes the lower bound
From 2 EM Algorithm: Intuition … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Observed Doc d =? p(w|1) … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … 1-=? p(w|2) Suppose we know the identity of each word… p(w|1 2) = p(w|1)+(1- )p(w|2)
E-step M-step Can We “Guess” the Identity? Identity (“hidden”) variable: zw{1 (w from 1), 0(w from 2)} zw 1 1 1 1 0 0 0 1 0 ... What’s a reasonable guess? - depends on (why?) - depends on p(w| 1)) and p(w|2) (how?) the paper presents a text mining algorithm the paper ... Initially, set to some random value, then iterate …
Any Theoretical Guarantee? • EM is guaranteed to reach a LOCAL maximum • When “local maxima” = “global maxima”, EM can find the global maximum • But, when there are multiple local maximas, “special techniques” are needed (e.g., try different initial values) • In our case, we have one unique local maxima (why?)
A General Introduction to EM Data: X (observed) + H(hidden) Parameter: “Incomplete” likelihood: L( )= log p(X| ) “Complete” likelihood: Lc( )= log p(X,H| ) EM tries to iteratively maximize the complete likelihood: Starting with an initial guess (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute (n) by maximizing the Q-function
Convergence Guarantee Goal: maximizing “Incomplete” likelihood: L( )= log p(X| ) I.e., choosing (n), so thatL((n))-L((n-1))0 Note that, sincep(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X, (n-1) )/p(H|X, (n))] Taking expectation w.r.t.p(H|X, (n-1)), L((n))-L((n-1)) = Q((n); (n-1))-Q( (n-1); (n-1)) + D(p(H|X, (n-1))||p(H|X, (n))) EM chooses (n) to maximize Q KL-divergence, always non-negative Therefore, L((n)) L((n-1))!
Another way of looking at EM Likelihood p(X| ) L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) )+ D(p(H|X, (n-1) )||p(H|X, )) next guess current guess Lower bound (Q function) E-step = computing the lower bound M-step = maximizing the lower bound
What You Should Know • What is unigram language so important? • What is a unigram mixture language model? • How to estimate parameters of simple unigram mixture models using EM • Know the general idea of EM (EM will be covered again later in the course)