Corpora and Statistical Methods – Lecture 7

Corpora and Statistical Methods – Lecture 7 Albert Gatt

Part 2 Smoothing (aka discounting) techniques

Overview… • Smoothing methods: • Simple smoothing • Witten-Bell & Good-Turing estimation • Held-out estimation and cross-validation • Combining several n-gram models: • back-off models

Rationale behind smoothing • Sample frequencies • seen events with probability P • unseen events (including “grammatical” zeroes”) with probability 0 • Real population frequencies • seen events • (including the unseen events in our sample) + smoothing to approximate results in Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing).

Laplace’s law, Lidstone’s law and the Jeffreys-Perks law

Instances in the Training Corpus:“inferior to ________” F(w)

Maximum Likelihood Estimate F(w) Unknowns are assigned 0% probability mass

Actual Probability Distribution F(w) These are non-zero probabilities in the real distribution

LaPlace’s Law (Add-one smoothing) F(w)

LaPlace’s Law NB. This method ends up assigning most prob. mass to unseens F(w)

Generalisation: Lidstone’s Law • P = probability of specific n-gram • C(x) = count of n-gram x in training data • N = total n-grams in training data • V = number of “bins” (possible n-grams) • = small positive number M.L.E:  = 0LaPlace’s Law:  = 1 (add-one smoothing)Jeffreys-Perks Law:  = ½

Jeffreys-Perks Law F(w)

Objections to Lidstone’s Law • Need an a priori way to determine . • Predicts all unseen events to be equally likely • Gives probability estimates linear in the M.L.E. frequency

Witten-Bell discounting

Main intuition • A zero-frequency event can be thought of as an event which hasn’t happened (yet). • The probability of it happening can be estimated from the probability of sth happening for the first time. • The count of things which are seen only once can be used to estimate the count of things that are never seen.

Witten-Bell method • T = no. of times we saw an event for the first time. = no of different n-gram types (bins) NB: T is no. of types actually attested (unlike V, the no of possible types in add-one smoothing) • Estimate total probability mass of unseen n-grams: • each token is an event & each new type is an event • so above equation gives MLE of the probability of a new type event occurring (“being seen for the first time”) • This is the total probability mass to be distributed among all zero events (unseens) no of actual n-grams (N) + no of actual types (T)

Witten-Bell method • Divide the total probability mass among all the zero n-grams. Can distribute it equally. • Remove this probability mass from the non-zero n-grams (discounting):

Witten-Bell vs. Add-one • If we work with unigrams, Witten-Bell and Add-one smoothing give very similar results. • The difference is with n-grams for n>1. • Main idea: estimate probability of an unseen bigram <w1,w2> from the probability of seeing a bigram starting with w1 for the first time.

Witten-Bell with bigrams • Generalised total probability mass estimate: No. bigram types beginning with wx No. bigram tokens beginning with wx Estimated total probability of bigrams starting with some word wx

Witten-Bell with bigrams • Non-zero bigrams get discounted as before, but again conditioning on history: • Note: Witten-Bell won’t assign the same probability mass to all unseen n-grams. • The amount assigned will depend on the first word in the bigram (first n-1 words in the n-gram).

Corpora and Statistical Methods – Lecture 7