Mixture Language Models

Mixture Language Models ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Central Questions to Ask about a LM: “ADMI” • Application: Why do you need a LM? For what purpose? • Data: What kind of data do you want to model? • Model: How do you define the model? • Inference: How do you infer/estimate the parameters? Topic mining and anlaysis Evaluation metric for a LM Data set for estimation & evaluation Assumptions to be made Inference/Estimation algorithm Text documents & context Mixture of unigram LMs EM algorithm, Bayesian

Outline • Motivation • Mining one topic • Two-component mixture model • EM algorithm • Probabilistic Latent Semantic Analysis (PLSA) • Extensions of PLSA

Topic Mining and Analysis: Motivation • Topic  main idea discussed in text data • Theme/subject of a discussion or conversation • Different granularities (e.g., topic of a sentence, an article, etc.) • Many applications require discovery of topics in text • What are Twitter users talking about today? • What are the current research topics in data mining? How are they different from those 5 years ago? • What do people like about the iPhone 6? What do they dislike? • What were the major topics debated in 2012 presidential election?

Topics As Knowledge About the World Non-Text Data Knowledge about the world Topic 1 + Context Time Location … Text Data Topic 2 … Real World … Topic k

Tasks of Topic Mining and Analysis … Task 2: Figure out which documents cover which topics Doc 1 Doc 2 Doc N Topic 1 Text Data Topic 2 … Topic k Task 1: Discover k topics

Formal Definition of Topic Mining and Analysis • Input • A collection of N text documents C={d1, …, dN} • Number of topics: k • Output • k topics: { 1, …, k } • Coverage of topics in each di: { i1, …,ik } • ij= prob. of dicovering topic j How to define i ?

Initial Idea: Topic = Term … Doc 1 Doc 2 Doc N 30% Text Data “Sports” 11 21=0 N1=0 1 “Travel” 12 22 N2 2 … 12% “Science” k 2k 1k Nk 8%

Mining k Topical Terms from Collection C • Parse text in C to obtain candidate terms (e.g., term = word). • Design a scoring function to measure how good each term is as a topic. • Favor a representative term (high frequency is favored) • Avoid words that are too frequent (e.g., “the”, “a”). • TF-IDF weighting from retrieval can be very useful. • Domain-specific heuristics are possible (e.g., favor title words, hashtags in tweets). • Pick k terms with the highest scores but try to minimize redundancy. • If multiple terms are very similar or closely related, pick only one of them and ignore others.

Computing Topic Coverage: ij Doc di “Sports” count(“sports”, di)=4 i1 1 “Travel” count(“travel”, di) =2 i2 2 … “Science” count(“science”, di)=1 k ik

How Well Does This Approach Work? Doc di Cavaliers vs. Golden State Warriors: NBA playoff finals … basketballgame … travel to Cleveland … star … “Sports” 1. Need to count related words also! 1 “Travel” 2 … 2. “Star” can be ambiguous (e.g., star in the sky). “Science” k 3. Mine complicated topics?

Problems with “Term as Topic”  Topic = {Multiple Words} • Lack of expressive power • Can only represent simple/general topics • Can’t represent complicated topics • Incompleteness in vocabulary coverage • Can’t capture variations of vocabulary (e.g., related words) • Word sense ambiguity • A topical term or related term can be ambiguous (e.g., basketball star vs. star in the sky) + weights on words  Split an ambiguous word A probabilistic topic model can do all these!

Improved Idea: Topic = Word Distribution … k 1 2 “Science” “Travel” “Sports” P(w|k) P(w|1) P(w|2) travel 0.05 attraction 0.03 trip 0.01 flight 0.004 hotel 0.003 island 0.003 … culture 0.001 … play 0.0002 … sports 0.02 game 0.01 basketball 0.005 football 0.004 play 0.003 star 0.003 … nba 0.001 … travel 0.0005 … science 0.04 scientist 0.03 spaceship 0.006 telescope 0.004 genomics 0.004 star 0.002 … genetics 0.001 … travel 0.00001 … Vocabulary Set: V={w1, w2,….}

Probabilistic Topic Mining and Analysis • Input • A collection of N text documents C={d1, …, dN} • Vocabulary set: V={w1, …, wM} • Number of topics: k • Output • k topics, each a word distribution: { 1, …, k } • Coverage of topics in each di: { i1, …,ik } • ij=prob. of dicovering topic j

The Computation Task … • OUTPUT: { 1, …, k}, { i1, …, ik} INPUT: C, k, V Doc 1 Doc 2 Doc N Text Data 30% sports 0.02 game 0.01 basketball 0.005 football 0.004 … 1 11 21=0% N1=0% 2 12% travel 0.05 attraction 0.03 trip 0.01 … 12 22 N2 … 8% science 0.04 scientist 0.03 spaceship 0.006 … k 2k 1k Nk

Generative Model for Text Mining • Modeling of Data Generation: P(Data |Model, ) • =({ 1, …, k}, { 11, …, 1k }, …, { N1, …, Nk }) How many parameters in total? Parameter Estimation/ Inferences *= argmax p(Data|Model, ) P(Data |Model, )  *

General Ideas of Generative Models for Text Mining • Model data generation: P(Data |Model, ) • Infer the most likely parameter values * given a particular data set: *= argmax p(Data| Model, ) • Take * as the “knowledge”to be mined for the text mining problem • Adjust the design of the model to discover different knowledge

Simplest Case of Topic Model: Mining One Topic • OUTPUT: { } INPUT: C={d}, V Doc d P(w|) Text Data 100% text ? mining ? association ? database ? … query ? … 

Language Model Setup • Data: Document d= x1 x2 … x|d| , xiV={w1,…, wM}is a word • Model: Unigram LM (=topic) : {i=p(wi|)}, i=1, …, M; 1+…+M=1 • Likelihood function: • ML estimate:

Computation of Maximum Likelihood Estimate Maximize p(d|) Max. Log-Likelihood Subject to constraint: Use Lagrange multiplier approach Normalized Counts

What Does the Topic Look Like? p(w| ) d the 0.031 a 0.018 … text 0.04 mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 … Text mining paper Can we get rid of these common words?

Factoring out Background Words p(w| ) d the 0.031 a 0.018 … text 0.04 mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 … Text mining paper How can we get rid of these common words?

Generate d Using Two Word Distributions Topic: d text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 p(d )+(B)=1 d P(w|d) P(d)=0.5 Text mining paper Topic Choice the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 … • p(w| B) P(B)=0.5 Background(topic) B

What’s the probability of observing a word w? Topic: d text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 p(d )+(B)=1 d P(“the”)=p(d)p(“the”|d) + p(B)p(“the”|B) = 0.5*0.000001+0.5*0.03 P(w|d) P(d)=0.5 “the”? Topic Choice “text”? P(“text”)=p(d)p(“text”|d) + p(B) p(“text”| B) = 0.5*0.04+0.5*0.000006 the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 … • p(w| B) P(B)=0.5 Background(topic) B

The Idea of a Mixture Model text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 d Mixture Model p(d )+(B)=1 P(d)=0.5 P(w|d) “the”? w Topic Choice “text”? the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 B • p(w| B) P(B)=0.5

As a Generative Model… Formally defines the following generative model: p(w)=p(d)p(w|d) + p(B )p(w| B) w Estimate of the model “discovers” two topics + topic coverage What if p(d)=1 or p(B )=1?

Mixture of Two Unigram Language Models • Data: Document d • Mixture Model: parameters=({p(w|d )}, {p(w|B )}, p(B), p(d )) • Two unigram LMs: d (the topic of d); B(background topic) • Mixing weight (topic choice): p(d )+p(B)=1 • Likelihood function: • ML Estimate: Subject to

Back to Factoring out Background Words text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 d Text Mining Paper d p(d )+(B)=1 P(w|d) P(d)=0.5 … text mining... is… clustering… we…. Text.. the Topic Choice the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 B P(B)=0.5 • p(w| B)

Estimation of One Topic: P(w| d) Adjustdto maximize p(d|) (all other parameters are known) text ? mining ? association ? clustering ? … the ? d p(d )+(B)=1 Would the ML estimate demote background words in d ? P(d)=0.5 d Topic Choice the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 B … text mining... is… clustering… we…. Text.. the P(B)=0.5

Estimation of One Topic: P(w| d) How to setdto maximize p(d|)? (all other parameters are known) text ? mining ? association ? clustering ? … the ? d p(d )+(B)=1 P(d)=0.5 d Topic Choice the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 B … text mining... is… clustering… we…. Text.. the P(B)=0.5

If we know which word is from which distribution… text ? mining ? association ? clustering ? … the ? d p(d )+(B)=1 P(w|d) d’ P(d)=0.5 d Topic Choice • p(w| B) the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 … text mining... is… clustering… we…. Text.. the B P(B)=0.5

Given all the parameters, infer the distribution a word is from… Is “text” more likely from d or B? p(d )+p(B)=1 text 0.04 mining 0.035 association 0.03 clustering 0.005 … the 0.000001 d P(d)=0.5 P(w|d) From d(Z=0)? p(d)p(“text”|d) Topic Choice From B (Z=1)? the 0.03 a 0.02 is 0.015 we 0.01 food 0.003 … text 0.000006 B p(B)p(“text”|B) P(B)=0.5 • p(w| B)

The Expectation-Maximization (EM) Algorithm Hidden Variable: z {0, 1} Initialize p(w|d ) with random values. Then iteratively improve it using E-step & M-step. Stop when likelihood doesn’t change. z 1 1 1 1 0 0 0 1 0 ... the paper presents a text mining algorithm for clustering ... E-step How likely wis from d M-step

EM Computation in Action E-step Assume p(d )=p(B)= 0.5 and p(w|B) is known M-step Likelihood increasing “By products”: Are they also useful?

EM As Hill-Climbing  Converge to Local Maximum Likelihood p(d| ) E-step = computing the lower bound Original likelihood Lower bound of likelihood function next guess current guess  M-step = maximizing the lower bound

A General Introduction to EM Data: X (observed) + H(hidden) Parameter:  “Incomplete” likelihood: L( )= log p(X| ) “Complete” likelihood: Lc( )= log p(X,H| ) EM tries to iteratively maximize the incomplete likelihood: Starting with an initial guess (0), 1. E-step: compute the expectation of the complete likelihood 2. M-step: compute (n) by maximizing the Q-function

Convergence Guarantee Goal: maximizing “Incomplete” likelihood: L( )= log p(X| ) I.e., choosing (n), so that L((n))-L((n-1))0 Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))] Taking expectation w.r.t. p(H|X, (n-1)), L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n))) Doesn’t contain H EM chooses (n) to maximize Q KL-divergence, always non-negative Therefore, L((n))  L((n-1))!

EM as Hill-Climbing: converging to a local maximum Likelihood p(X| ) L()=L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) )+ D(p(H|X,  (n-1) )||p(H|X,  )) L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) ) next guess current guess Lower bound (Q function)  E-step = computing the lower bound M-step = maximizing the lower bound

Document as a Sample of Mixed Topics government 0.3 response 0.2... Blog article about “Hurricane Katrina” Topic 1 [ Criticism ofgovernment response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [flooding of New Orleans. … 80%of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … city 0.2new 0.1orleans 0.05 ... Topic 2 … donate 0.1relief 0.05help 0.02 ... Topic k the 0.04a 0.03 ... Background B Many applications are possible if we can “decode” the topics in text…

Mining Multiple Topics from Text … • OUTPUT: { 1, …, k}, { i1, …, ik} INPUT: C, k, V Doc 1 Doc 2 Doc N Text Data 30% sports 0.02 game 0.01 basketball 0.005 football 0.004 … 1 11 21=0% N1=0% 2 12% travel 0.05 attraction 0.03 trip 0.01 … 12 22 N2 … 8% science 0.04 scientist 0.03 spaceship 0.006 … k 2k 1k Nk

Generating Text with Multiple Topics: p(w)=? p(w|1) government 0.3 response 0.2... (1-B)p(1) p(w|1) Topic 1 + p(1)=d,1 p(w|2) city 0.2new 0.1orleans 0.05 ... (1-B)p(2) p(w|2) Topic 2 … + … + … p(2)=d,2 w donate 0.1relief 0.05help 0.02 ... p(w|k) (1- B)p(k) p(w|k) Topic k 1- B p(k)=d,k + Topic Choice the 0.04a 0.03 ... p(w|B) Background B B p(w|B) p(B)= B

Probabilistic Latent Semantic Analysis (PLSA) Percentage of background words (known) Coverage of topic  j in doc d Background LM (known) Prob. of word w in topic  j Unknown Parameters: =({d,j}, { j}), j=1, …, k How many unknown parameters are there in total?

ML Parameter Estimation Constrained Optimization:

EM Algorithm for PLSA: E-Step Hidden Variable (=topic indicator): zd,w{B, 1, 2, …, k} Probability that w in doc d is generated from topic  j Use of Bayes Rule Probability that w in doc d is generated from background  B

EM Algorithm for PLSA: M-Step Hidden Variable (=topic indicator): zd,w{B, 1, 2, …, k} ML Estimate based on “allocated” word counts to topic  j Re-estimated probability of doc d covering topic  j Re-estimated probability of word w for topic  j

Computation of the EM Algorithm • Initialize all unknown parameters randomly • Repeat until likelihood converges • E-step • M-step What’s the normalizer for this one? In general, accumulate counts, and then normalize

Mixture Language Models