CS 679 : Text Mining

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 679: Text Mining Lecture #5: Conjugate Priors Slides by Eric Ringger

Announcements • Reading Report #3: • Meila & Heckerman on EM • Still discussing fundamental ideas • Potential Publications • Due today • In preparation for pre-proposal • This week: rank and select with me • Reading Report #4: • Russell & Norvig 14.5 • Due: Wednesday

Objectives • Introduce the Beta and Dirichlet distributions • Explain the idea of a “conjugate prior”

A: Probability Simplex Q: What space do the parameters of a categorical or multinomial live in? 1-D / 2 parameters (Bernoulli / binomial): 2-D / 3 parameters: 3-D / 4 parameters:

Beta Distribution • Parameters ,  determine form of the density • On the 1-D simplex • For these examples, we have a symmetric Beta with  =  f( ) f( ) f( )

Preliminaries: Functions Credit: Wikipedia

Beta Distribution Revisited • Parameters ,  determine form of the density • On the 1-D simplex • For these examples, we have a symmetric Beta with  =  f( ) f( ) f( )

Dirichlet Distribution α=(6, 2, 2) α=(3, 7, 5) Probability density of the Dirichlet distribution when K=3 for various parameter vectors α α=(6, 2, 6) α=(2, 3, 4) Credit: Wikipedia

Symmetric Dirichlet How the log of the density function changes when K=3 as we change the vector α from α=(0.3, 0.3, 0.3) to (2.0, 2.0, 2.0), keeping all the individual αi's equal to one another. Credit: Wikipedia

Dirichlet Distribution • Density function • Dirichlet distribution of order K: • Where: • xi >=0 • i >=0 • Denominator: Credit: Wikipedia

Beta-Binomial Conjugacy 1. Assume a binomial model of our data : 2. Given data , use Bayes law to update the model : 3. Employ a prior distribution over with parameters : 4. New version of (2):

Beta-Binomial Conjugacy

Dirichlet-Multinomial Conjugacy • Generalization of Beta-Binomial Conjugacy

Mixture of Multinomials Modelin one Slide Mixture model:  ci xi,j V N

Clustering Methods • Algorithms compared by Meila & Heckerman: • Probabilistic HAC • Like Ward’s method • EM = Expectation Maximization • CEM = Classification EM = “Hard EM” • Winner-take-all E step • Analogous to prob. k-means (using MM instead of MG)

Next • EM!

CS 679 : Text Mining