1 / 20

Formal Multinomial and Multiple-Bernoulli Language Models

Formal Multinomial and Multiple-Bernoulli Language Models. Don Metzler. Overview. Two formal estimation techniques MAP estimates [Zargoza, Hiemstra, Tipping, SIGIR’03] Posterior expectations Language models considered Multinomial Multiple-Bernoulli (2 models).

Thomas
Download Presentation

Formal Multinomial and Multiple-Bernoulli Language Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Formal Multinomial and Multiple-Bernoulli Language Models Don Metzler

  2. Overview • Two formal estimation techniques • MAP estimates [Zargoza, Hiemstra, Tipping, SIGIR’03] • Posterior expectations • Language models considered • Multinomial • Multiple-Bernoulli (2 models)

  3. Bayesian Framework(MAP Estimation) • Assume textual data X (document, query, etc) is generated by sampling from some distribution P(X | θ) parameterized by θ • Assume some prior over θ. • For each X, we want to find the maximum a posteriori (MAP) estimate: • θX is our (language) model for data X

  4. Multinomial • Modeling assumptions: • Why Dirichlet? • Conjugate prior to multinomial • Easy to work with

  5. Multinomial

  6. How do we set α? • α= 1 => uniform prior => ML estimate • α= 2 => Laplacian smoothing • Dirichlet-like smoothing:

  7. left – ML estimate – α = 1 center – Laplace – α = 2 right – α = μP(w | C) μ= 10 X = A B B B P(A | C) = 0.45 P(B | C) = 0.55

  8. Multiple-Bernoulli • Assume vocabulary V = A B C D • How do we model text X = D B B D? • In multinomial, we represent X as the sequence D B B D • In multiple-Bernoulli we represent X as the vector [0 1 0 1] denoting terms B and D occur in X • Each X represented by single binary vector

  9. Multiple-Bernoulli(Model A) • Modeling assumptions: • Each X is a single sample from a multiple-Bernoulli distribution parameterized by θ • Use conjugate prior (multiple-Beta)

  10. Multiple-Bernoulli(Model A)

  11. Problems with Model A • Ignores document length • This may be desirable in some applications • Ignores term frequencies • How to solve this? • Model X as a collection of samples (one per word occurrence) from an underlying multiple-Bernoulli distribution • Example:V = A B C D, X = B D D BRepresentation: { [0 1 0 0], [0 0 0 1], [0 0 0 1], [0 1 0 0] }

  12. Multiple-Bernoulli(Model B) • Modeling assumptions: • Each X is a collection (multiset) of indicator vectors sampled from a multiple-Bernoulli distribution parameterized by θ • Use conjugate prior (multiple-Beta)

  13. Multiple-Bernoulli(Model B)

  14. How do we set α, β? • α= β= 1 => uniform prior => ML estimate • But we want smoothed probabilities… • One possibility:

  15. Multiple-Bernoulli Model B left – ML estimate α = β = 1 center – smoothed (μ= 1) right – smoothed (μ= 10) X = A B B B P(A | C) = 0.45 P(B | C) = 0.55

  16. Another approach… • Another way to formally estimate language models is via: • Expectation over posterior • Takes more uncertainty into account than MAP estimate • Because we chose to use conjugate priors the integral can be evaluated analytically

  17. Multinomial / Multiple-BernoulliConnection • Multinomial • Multiple-Bernoulli • Dirichlet smoothing

  18. Bayesian Framework(Ranking) • Query likelihood • estimate model θD for each document D • score document D by P(Q | θD) • measures likelihood of observing query Q given model θD • KL-divergence • estimate model for both query and document • score document D by KL(θQ || θD) • measures “distance” between two models • Predictive density

  19. Results

  20. Conclusions • Both estimation and smoothing can achieved using Bayesian estimation techniques • Little difference between MAP and posterior expectation estimates – mostly depends on μ • Not much difference between Multinomial and Multiple-Bernoulli language models • Scoring multinomial is cheaper • No good reason to choose multiple-Bernoulli over multinomial in general

More Related