1 / 14

Mixture Language Models and EM Algorithm

Mixture Language Models and EM Algorithm. (Lecture for CS397-CXZ Intro Text Info Systems) Sept. 17, 2003 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Rest of this Lecture. Unigram mixture models Slightly more sophisticated unigram LMs

Download Presentation

Mixture Language Models and EM Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mixture Language Models and EM Algorithm (Lecture for CS397-CXZ Intro Text Info Systems) Sept. 17, 2003 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

  2. Rest of this Lecture • Unigram mixture models • Slightly more sophisticated unigram LMs • Related to smoothing • EM algorithm • VERY useful for estimating parameters of a mixture model or when latent/hidden variables are involved • Will occur again and again in the course

  3. Modeling a Multi-topic Document A document with 2 types of vocabulary … text mining passage food nutrition passage text mining passage text mining passage food nutrition passage … How do we model such a document? How do we “generate” such a document? How do we estimate our model? Solution: A mixture model + EM

  4. Simple Unigram Mixture Model … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … =0.7 Model/topic 1 p(w|1) … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … 1-=0.3 Model/topic 2 p(w|2) p(w|1 2) = p(w|1)+(1- )p(w|2)

  5. The doc is about text mining and food nutrition, how much percent is about text mining? 30% of the doc is about text mining, what’s the rest about? The doc is about text mining, is it also about some other topic, and if so to what extent? 30% of the doc is about one topic and 70% is about another, what are these two topics? The doc is about two subtopics, find out what these Two subtopics are and to what extent the doc covers Each. Parameter Estimation Likelihood: • Estimation scenarios: • p(w|1) & p(w|2) are known; estimate  • p(w|1) &  are known; estimate p(w|2) • p(w|1) is known; estimate  & p(w|2) •  is known; estimate p(w|1)& p(w|2) • Estimate , p(w|1), p(w|2) =clustering

  6. No closed form solution, But many numerical algorithms Likelihood logL() next guess current guess Lower bound  (n) (n+1) Parameter Estimation Example:Given p(w|1) and p(w|2), estimate Maximum Likelihood: Expectation-Maximization (EM) Algorithm is a commonly used method Basic idea: Start from some random guess of parameter values, and then Iteratively improve our estimates (“hill climbing”) E-step: compute the lower bound M-step: find a new  that maximizes the lower bound

  7. From 2 EM Algorithm: Intuition … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Observed Doc d =? p(w|1) … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … 1-=? p(w|2) Suppose we know the identity of each word… p(w|1 2) = p(w|1)+(1- )p(w|2)

  8. E-step M-step Can We “Guess” the Identity? Identity (“hidden”) variable: zw{1 (w from 1), 0(w from 2)} zw 1 1 1 1 0 0 0 1 0 ... What’s a reasonable guess? - depends on  (why?) - depends on p(w| 1)) and p(w|2) (how?) the paper presents a text mining algorithm the paper ... Initially, set  to some random value, then iterate …

  9. An Example of EM Computation

  10. Any Theoretical Guarantee? • EM is guaranteed to reach a LOCAL maximum • When “local maxima” = “global maxima”, EM can find the global maximum • But, when there are multiple local maximas, “special techniques” are needed (e.g., try different initial values) • In our case, we have one unique local maxima (why?)

  11. A General Introduction to EM Data: X (observed) + H(hidden) Parameter:  “Incomplete” likelihood: L( )= log p(X| ) “Complete” likelihood: Lc( )= log p(X,H| ) EM tries to iteratively maximize the complete likelihood: Starting with an initial guess (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute (n) by maximizing the Q-function

  12. Convergence Guarantee Goal: maximizing “Incomplete” likelihood: L( )= log p(X| ) I.e., choosing (n), so thatL((n))-L((n-1))0 Note that, sincep(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))] Taking expectation w.r.t.p(H|X, (n-1)), L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n))) EM chooses (n) to maximize Q KL-divergence, always non-negative Therefore, L((n))  L((n-1))!

  13. Another way of looking at EM Likelihood p(X| ) L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) )+ D(p(H|X,  (n-1) )||p(H|X,  )) next guess current guess Lower bound (Q function)  E-step = computing the lower bound M-step = maximizing the lower bound

  14. What You Should Know • What is unigram language so important? • What is a unigram mixture language model? • How to estimate parameters of simple unigram mixture models using EM • Know the general idea of EM (EM will be covered again later in the course)

More Related