240 likes | 358 Views
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License . CS 679 : Text Mining. Lecture #6: Expectation Maximization for Mixture Of Multinomials. Slides by Eric Ringger. Announcements. Reading Report #4: Russell & Norvig 14.5 Due : today
E N D
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 679: Text Mining Lecture #6: Expectation Maximizationfor Mixture Of Multinomials Slides by Eric Ringger
Announcements • Reading Report #4: • Russell & Norvig 14.5 • Due: today • Potential Publications • Assignment #3: rank and select 10, with focus on best 2 • Reading Report #5 • McKay on MCMC • Due next Wednesday
Objectives • Explore EM for the MM model • Discuss Marginal Likelihood • Consider three possible initialization methods for EM
Mixture of Multinomials Modelin one Slide Mixture model: ci xi,j V N
Clustering Methods • Algorithms compared by Meila & Heckerman: • Probabilistic HAC • Like Ward’s method • EM = Expectation Maximization • CEM = Classification EM = “Hard EM” • Winner-take-all E step • Analogous to prob. k-means (using MM instead of MG)
M&H’s Probabilistic Approachto Hierarchical Agglomerative Clustering • Likelihood of the data D given an assignment of the data to clusters: • Where: • k are the ML parameters of the (multinomial) distribution P(xi|ck) • L k() depends only on the cases in cluster ck • When merging clusters k and l, • Assign all cases to newly formed cluster c<k,l> • Log-likelihood L decreases by:
High Level:Expectation-Maximization (EM) • Iterative procedure: • Choose some initial parameters for the model • Use model to estimate partial label counts for all docs (E-step) • Use the new complete data to learn better model (M-step) • Repeat steps 2 and 3 until convergence
EM D: Posterior distribution over clusters p(c|d) for all documents C: Fractional total countsbased on D : Model parameter estimates, computed from C When to stop?
EM for Naïve Bayes: E Step Dog dogdog cat Canine dog woof Cat catcatcat Dog dogdogdog Feline cat meow cat Dog dog cat cat Dog dogdog dog Feline cat Cat dog cat Dog collar Dog dog cat Raining cats and dogs Dog dog Puppy dog litter Cat kitten cub Kennel dog pound Kitten cat litter Dog cat catcatcat Year of the dog Dog cat mouse kitten
EM for Naïve Bayes: E Step Dog cat catcatcat 1. Suppose: p(cj = k1 | xj) = 0.2 p(cj = k2| xj) = 0.8 2. Then: Count(k1) += 0.2 Count(k2) += 0.8 Count(k1,dog) += 1 * 0.2 Count(k2,dog) += 1 * 0.8 Count(k1,cat) += 4 * 0.2 Count(k2,cat) += 4 * 0.8 Document j
EM for Naïve Bayes: M Step • Once you have your partial counts, re-estimate parameters like you would with regular counts
Etc. • In the next E step, the partial counts will be different, because the parameters have changed: • E.g., • And the likelihood will increase!
EM for Multinomial Mixture Model • Initialize model parameters • E-step: • Calculate posterior distributions over clusters for each document xi: 2. Use posteriors as fractional labels; Recount your counts! CEM:
M-Step • Re-estimate & from the fractionally labeled data: • What about smoothing?
Convergence & EM Properties • Each step of EM is guaranteed to increase data likelihood - a hill climbing procedure • Not guaranteed to find global maximum of data likelihood • Data likelihood typically has many local maxima for a general model class and rich feature set • Many “patterns” in the data that we can fit our model to… Data Likelihood Model Parameterizations
EM in General • EM is a technique for learning anytime we have incomplete data (x,y) • Induction Scenario (ours): we actually want to know y (e.g., clustering) • Convenience Scenario: we want the marginal, P(x), and including y just makes the model simpler (e.g., mixing weights) • General approach: learn y and θ • E-step: make a guess at posteriors P(y | x,θ) • This means scoring all completions with the current parameters • This is actually the easy part – treat the completions as (fractional) complete data, and count. • M-step: fit θ to maximize log likelihood log P(x,y | θ) • Then compute (smoothed) ML estimates of model parameters • EM is only locally optimal (why?)
Marginal Likelihood of the Data • Make the model choice (MM with K mixture components) also explicit as a part of the equation: • The posterior :
Marginal Likelihood of the Data • Marginalize the (factored) joint probability of data and model parameters: • The first term involves a distribution over model parameters! • Assumes a model class and model size (K) • Can now vary K and choose the best “model selection” • Must approximate this integral. • Meila & Heckerman use the Cheeseman-Stutz approx. • Others
Log Marginal Likelihood of the Data& Model Selection NOT Compare models using this Likelihood function - Graph from Meila & Heckerman
Marginal Likelihood • Great deck of slides by Dan Walker • Will be available as optional reading linked from schedule entry for today • We may use them in class later (after we learn about Gibbs sampling)
Initialization #1: Random Initialize the parameters of the model independently of the data • : Parameters for the distribution of the hidden class variable (mixture weights) • Initialize to be the uniform distribution • k (1 k K): Parameters for the distribution of the data vectors given the k-th hidden class variable • Sample from an uninformative distribution, namely from a uniformDirichlet distribution: j = 1 (1 j V)
Initialization #2:Noisy-Marginal (a.k.a. “Marginal”) • : Same as in Initialization #1 • k: Parameters for the distribution of the data vectors given the k-th hidden class variable • Initialize the model in a data-dependent fashion • Estimate the parameters of a single Dirichlet distributionas follows: • Estimate by computing the count of each feature wj(1 j V) over the whole data set (this is the ML estimate). • or MAP estimate when a Dirichlet prior is employed (e.g., [1,1,1, …, 1] == add-one smoothing) • Normalize by total count of observed features (i.e., sum of xj over all j). • Multiply by user-specified sample size (“equivalent sample size”; they use 2) • Raise the height of the density peak around the mean • For each component k, sample parameters kfrom this Dirichlet() • (Thiesson, Meek, Chickering, and Heckerman)
Initialization #3: HAC • Data-dependent • Perform HAC on random sample of the data • Using all is often intractable • Inaccuracies from sub-sampling may be corrected by EM or CEM • Extract set of sufficient statistics (counts) • Set the model parameters to be the MAP given the counts obtained from agglomeration and a uniform (Dirichlet) prior distribution. • Cluster Refinement
Next • MCMC • Gibbs sampling • LDA