CS 679 : Text Mining

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 679: Text Mining Lecture #6: Expectation Maximizationfor Mixture Of Multinomials Slides by Eric Ringger

Announcements • Reading Report #4: • Russell & Norvig 14.5 • Due: today • Potential Publications • Assignment #3: rank and select 10, with focus on best 2 • Reading Report #5 • McKay on MCMC • Due next Wednesday

Objectives • Explore EM for the MM model • Discuss Marginal Likelihood • Consider three possible initialization methods for EM

Mixture of Multinomials Modelin one Slide Mixture model:  ci xi,j V N

Clustering Methods • Algorithms compared by Meila & Heckerman: • Probabilistic HAC • Like Ward’s method • EM = Expectation Maximization • CEM = Classification EM = “Hard EM” • Winner-take-all E step • Analogous to prob. k-means (using MM instead of MG)

M&H’s Probabilistic Approachto Hierarchical Agglomerative Clustering • Likelihood of the data D given an assignment of the data to clusters: • Where: • k are the ML parameters of the (multinomial) distribution P(xi|ck) • L k() depends only on the cases in cluster ck • When merging clusters k and l, • Assign all cases to newly formed cluster c<k,l> • Log-likelihood L decreases by:

High Level:Expectation-Maximization (EM) • Iterative procedure: • Choose some initial parameters for the model • Use model to estimate partial label counts for all docs (E-step) • Use the new complete data to learn better model (M-step) • Repeat steps 2 and 3 until convergence

EM D: Posterior distribution over clusters p(c|d) for all documents C: Fractional total countsbased on D : Model parameter estimates, computed from C When to stop?

EM for Naïve Bayes: E Step Dog dogdog cat Canine dog woof Cat catcatcat Dog dogdogdog Feline cat meow cat Dog dog cat cat Dog dogdog dog Feline cat Cat dog cat Dog collar Dog dog cat Raining cats and dogs Dog dog Puppy dog litter Cat kitten cub Kennel dog pound Kitten cat litter Dog cat catcatcat Year of the dog Dog cat mouse kitten

EM for Naïve Bayes: E Step Dog cat catcatcat 1. Suppose: p(cj = k1 | xj) = 0.2 p(cj = k2| xj) = 0.8 2. Then: Count(k1) += 0.2 Count(k2) += 0.8 Count(k1,dog) += 1 * 0.2 Count(k2,dog) += 1 * 0.8 Count(k1,cat) += 4 * 0.2 Count(k2,cat) += 4 * 0.8 Document j

EM for Naïve Bayes: M Step • Once you have your partial counts, re-estimate parameters like you would with regular counts

Etc. • In the next E step, the partial counts will be different, because the parameters have changed: • E.g., • And the likelihood will increase!

EM for Multinomial Mixture Model • Initialize model parameters • E-step: • Calculate posterior distributions over clusters for each document xi: 2. Use posteriors as fractional labels; Recount your counts! CEM:

M-Step • Re-estimate  &  from the fractionally labeled data: • What about smoothing?

Convergence & EM Properties • Each step of EM is guaranteed to increase data likelihood - a hill climbing procedure • Not guaranteed to find global maximum of data likelihood • Data likelihood typically has many local maxima for a general model class and rich feature set • Many “patterns” in the data that we can fit our model to… Data Likelihood Model Parameterizations

EM in General • EM is a technique for learning anytime we have incomplete data (x,y) • Induction Scenario (ours): we actually want to know y (e.g., clustering) • Convenience Scenario: we want the marginal, P(x), and including y just makes the model simpler (e.g., mixing weights) • General approach: learn y and θ • E-step: make a guess at posteriors P(y | x,θ) • This means scoring all completions with the current parameters • This is actually the easy part – treat the completions as (fractional) complete data, and count. • M-step: fit θ to maximize log likelihood log P(x,y | θ) • Then compute (smoothed) ML estimates of model parameters • EM is only locally optimal (why?)

Marginal Likelihood of the Data • Make the model choice (MM with K mixture components) also explicit as a part of the equation: • The posterior :

Marginal Likelihood of the Data • Marginalize the (factored) joint probability of data and model parameters: • The first term involves a distribution over model parameters! • Assumes a model class and model size (K) • Can now vary K and choose the best  “model selection” • Must approximate this integral. • Meila & Heckerman use the Cheeseman-Stutz approx. • Others

Log Marginal Likelihood of the Data& Model Selection NOT Compare models using this Likelihood function - Graph from Meila & Heckerman

Marginal Likelihood • Great deck of slides by Dan Walker • Will be available as optional reading linked from schedule entry for today • We may use them in class later (after we learn about Gibbs sampling)

Initialization #1: Random Initialize the parameters of the model independently of the data • : Parameters for the distribution of the hidden class variable (mixture weights) • Initialize to be the uniform distribution • k (1  k K): Parameters for the distribution of the data vectors given the k-th hidden class variable • Sample from an uninformative distribution, namely from a uniformDirichlet distribution: j = 1 (1  j  V)

Initialization #2:Noisy-Marginal (a.k.a. “Marginal”) • : Same as in Initialization #1 • k: Parameters for the distribution of the data vectors given the k-th hidden class variable • Initialize the model in a data-dependent fashion • Estimate the parameters  of a single Dirichlet distributionas follows: • Estimate by computing the count of each feature wj(1  j  V) over the whole data set (this is the ML estimate). • or MAP estimate when a Dirichlet prior is employed (e.g., [1,1,1, …, 1] == add-one smoothing) • Normalize by total count of observed features (i.e., sum of xj over all j). • Multiply by user-specified sample size (“equivalent sample size”; they use 2) • Raise the height of the density peak around the mean • For each component k, sample parameters kfrom this Dirichlet() • (Thiesson, Meek, Chickering, and Heckerman)

Initialization #3: HAC • Data-dependent • Perform HAC on random sample of the data • Using all is often intractable • Inaccuracies from sub-sampling may be corrected by EM or CEM • Extract set of sufficient statistics (counts) • Set the model parameters to be the MAP given the counts obtained from agglomeration and a uniform (Dirichlet) prior distribution. • Cluster Refinement

Next • MCMC • Gibbs sampling • LDA

CS 679 : Text Mining