210 likes | 523 Views
Formal Multinomial and Multiple-Bernoulli Language Models. Don Metzler. Overview. Two formal estimation techniques MAP estimates [Zargoza, Hiemstra, Tipping, SIGIR’03] Posterior expectations Language models considered Multinomial Multiple-Bernoulli (2 models).
E N D
Formal Multinomial and Multiple-Bernoulli Language Models Don Metzler
Overview • Two formal estimation techniques • MAP estimates [Zargoza, Hiemstra, Tipping, SIGIR’03] • Posterior expectations • Language models considered • Multinomial • Multiple-Bernoulli (2 models)
Bayesian Framework(MAP Estimation) • Assume textual data X (document, query, etc) is generated by sampling from some distribution P(X | θ) parameterized by θ • Assume some prior over θ. • For each X, we want to find the maximum a posteriori (MAP) estimate: • θX is our (language) model for data X
Multinomial • Modeling assumptions: • Why Dirichlet? • Conjugate prior to multinomial • Easy to work with
How do we set α? • α= 1 => uniform prior => ML estimate • α= 2 => Laplacian smoothing • Dirichlet-like smoothing:
left – ML estimate – α = 1 center – Laplace – α = 2 right – α = μP(w | C) μ= 10 X = A B B B P(A | C) = 0.45 P(B | C) = 0.55
Multiple-Bernoulli • Assume vocabulary V = A B C D • How do we model text X = D B B D? • In multinomial, we represent X as the sequence D B B D • In multiple-Bernoulli we represent X as the vector [0 1 0 1] denoting terms B and D occur in X • Each X represented by single binary vector
Multiple-Bernoulli(Model A) • Modeling assumptions: • Each X is a single sample from a multiple-Bernoulli distribution parameterized by θ • Use conjugate prior (multiple-Beta)
Problems with Model A • Ignores document length • This may be desirable in some applications • Ignores term frequencies • How to solve this? • Model X as a collection of samples (one per word occurrence) from an underlying multiple-Bernoulli distribution • Example:V = A B C D, X = B D D BRepresentation: { [0 1 0 0], [0 0 0 1], [0 0 0 1], [0 1 0 0] }
Multiple-Bernoulli(Model B) • Modeling assumptions: • Each X is a collection (multiset) of indicator vectors sampled from a multiple-Bernoulli distribution parameterized by θ • Use conjugate prior (multiple-Beta)
How do we set α, β? • α= β= 1 => uniform prior => ML estimate • But we want smoothed probabilities… • One possibility:
Multiple-Bernoulli Model B left – ML estimate α = β = 1 center – smoothed (μ= 1) right – smoothed (μ= 10) X = A B B B P(A | C) = 0.45 P(B | C) = 0.55
Another approach… • Another way to formally estimate language models is via: • Expectation over posterior • Takes more uncertainty into account than MAP estimate • Because we chose to use conjugate priors the integral can be evaluated analytically
Multinomial / Multiple-BernoulliConnection • Multinomial • Multiple-Bernoulli • Dirichlet smoothing
Bayesian Framework(Ranking) • Query likelihood • estimate model θD for each document D • score document D by P(Q | θD) • measures likelihood of observing query Q given model θD • KL-divergence • estimate model for both query and document • score document D by KL(θQ || θD) • measures “distance” between two models • Predictive density
Conclusions • Both estimation and smoothing can achieved using Bayesian estimation techniques • Little difference between MAP and posterior expectation estimates – mostly depends on μ • Not much difference between Multinomial and Multiple-Bernoulli language models • Scoring multinomial is cheaper • No good reason to choose multiple-Bernoulli over multinomial in general