1 / 21

Corpora and Statistical Methods – Lecture 7

Corpora and Statistical Methods – Lecture 7. Albert Gatt. Part 2. Smoothing (aka discounting) techniques. Overview… . Smoothing methods: Simple smoothing Witten-Bell & Good-Turing estimation Held-out estimation and cross-validation Combining several n-gram models: back-off models.

lida
Download Presentation

Corpora and Statistical Methods – Lecture 7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpora and Statistical Methods – Lecture 7 Albert Gatt

  2. Part 2 Smoothing (aka discounting) techniques

  3. Overview… • Smoothing methods: • Simple smoothing • Witten-Bell & Good-Turing estimation • Held-out estimation and cross-validation • Combining several n-gram models: • back-off models

  4. Rationale behind smoothing • Sample frequencies • seen events with probability P • unseen events (including “grammatical” zeroes”) with probability 0 • Real population frequencies • seen events • (including the unseen events in our sample) + smoothing to approximate results in Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing).

  5. Laplace’s law, Lidstone’s law and the Jeffreys-Perks law

  6. Instances in the Training Corpus:“inferior to ________” F(w)

  7. Maximum Likelihood Estimate F(w) Unknowns are assigned 0% probability mass

  8. Actual Probability Distribution F(w) These are non-zero probabilities in the real distribution

  9. LaPlace’s Law (Add-one smoothing) F(w)

  10. LaPlace’s Law (Add-one smoothing) F(w)

  11. LaPlace’s Law NB. This method ends up assigning most prob. mass to unseens F(w)

  12. Generalisation: Lidstone’s Law • P = probability of specific n-gram • C(x) = count of n-gram x in training data • N = total n-grams in training data • V = number of “bins” (possible n-grams) • = small positive number M.L.E:  = 0LaPlace’s Law:  = 1 (add-one smoothing)Jeffreys-Perks Law:  = ½

  13. Jeffreys-Perks Law F(w)

  14. Objections to Lidstone’s Law • Need an a priori way to determine . • Predicts all unseen events to be equally likely • Gives probability estimates linear in the M.L.E. frequency

  15. Witten-Bell discounting

  16. Main intuition • A zero-frequency event can be thought of as an event which hasn’t happened (yet). • The probability of it happening can be estimated from the probability of sth happening for the first time. • The count of things which are seen only once can be used to estimate the count of things that are never seen.

  17. Witten-Bell method • T = no. of times we saw an event for the first time. = no of different n-gram types (bins) NB: T is no. of types actually attested (unlike V, the no of possible types in add-one smoothing) • Estimate total probability mass of unseen n-grams: • each token is an event & each new type is an event • so above equation gives MLE of the probability of a new type event occurring (“being seen for the first time”) • This is the total probability mass to be distributed among all zero events (unseens) no of actual n-grams (N) + no of actual types (T)

  18. Witten-Bell method • Divide the total probability mass among all the zero n-grams. Can distribute it equally. • Remove this probability mass from the non-zero n-grams (discounting):

  19. Witten-Bell vs. Add-one • If we work with unigrams, Witten-Bell and Add-one smoothing give very similar results. • The difference is with n-grams for n>1. • Main idea: estimate probability of an unseen bigram <w1,w2> from the probability of seeing a bigram starting with w1 for the first time.

  20. Witten-Bell with bigrams • Generalised total probability mass estimate: No. bigram types beginning with wx No. bigram tokens beginning with wx Estimated total probability of bigrams starting with some word wx

  21. Witten-Bell with bigrams • Non-zero bigrams get discounted as before, but again conditioning on history: • Note: Witten-Bell won’t assign the same probability mass to all unseen n-grams. • The amount assigned will depend on the first word in the bigram (first n-1 words in the n-gram).

More Related