Natural Language Processing Lecture 7—9/19/2013 Jim Martin
Today • More Language modeling (N-grams) • Smoothing • Finish Good-Turing • Pretty good smoothing • Bayesian prior smoothing • Word classes • Part of speech tagging
SmoothingDealing w/ Zero Counts • Back to Shakespeare • Recall that Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams... • So, 99.96% of the possible bigrams were never seen (have zero entries in the table) • Does that mean that any sentence that contains one of those bigrams should have a probability of 0? • For generation (shannon game) it means we'll never emit those bigrams • But for analysis it's problematic because if we run across a new bigram in the future then we have no choice but to assign it a probability of zero..
Zero Counts • Some of those zeros are really zeros... • Things that really aren't ever going to happen • Fewer of these than you might think • On the other hand, some of them are just rare events. • If the training corpus had been a little bigger they would have had a count • What would that count be in all likelihood?
Zero Counts • Zipf's Law (long tail phenomenon) • A small number of events occur with high frequency • A large number of events occur with low frequency • You can quickly collect statistics on the high frequency events • You might have to wait an arbitrarily long time to get good statistics on low frequency events • Result • Our estimates are necessarily sparse! We have no counts at all for the vast number of events we want to estimate. • Answer • Estimate the likelihood of unseen (zero count) N-grams!
Laplace Smoothing • Also called Add-One smoothing • Just add one to all the counts! • Very simple • MLE estimate: • Laplace estimate: • Reconstructed counts:
BERP Bigram Counts
Laplace-Smoothed Bigram Counts
Laplace-Smoothed Bigram Probabilities
Reconstituted Counts
Reconstituted Counts (2)
Big Change to the Counts! • C(want to) went from 608 to 238! • P(to|want) from .66 to .26! • Discount d= c*/c • d for "chinese food" =.10!!! A 10x reduction • So in general, Laplace is a blunt instrument • Could use more fine-grained method (add-k) • But Laplace smoothing not generally used for N-grams, as we have much better methods • Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially • For pilot studies • In document classification • Information retrieval • In domains where the number of zeros isn't so huge.
Fun with Unix • Thanks to Ken Church • Unix for Poets
Better Smoothing • Intuition used by many smoothing algorithms • Good-Turing • Kneser-Ney • Witten-Bell Use the count of things we've seen once to help estimate the count of things we've never seen
One Fish Two Fish • Imagine you are fishing • There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass • Not sure where this fishing hole is... • You have caught up to now • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish • How likely is it that the next fish to be caught is an eel? • How likely is it that the next fish caught will be a member of newly seen species? • Now how likely is it that the next fish caught will be an eel? Slide adapted from Josh Goodman
Good-Turing • Notation: Nx is the frequency-of-frequency-x • So N10=1 • Number of fish species seen 10 times is 1 (carp) • N1=3 • Number of fish species seen 1 is 3 (trout, salmon, eel) • To estimate the probability of an unseen species • Use number of species (words) we've seen once • c0* =c1p0 = N1/N • All other estimates are adjusted downward to account for unseen probabilities 3/18 c*(eel) = c*(1) = (1+1) 1/ 3 = 2/3 Slide from Josh Goodman
Bigram Frequencies of Frequencies and GT Re-estimates
Bigram Frequencies of Frequencies and GT Re-estimates 3*= 4 * (381/642) = 4 * .593 = 2.37
GT Smoothed Bigram Probabilities
GT Complications • In practice, assume large counts (c>k for some k) are reliable: • Also, need all the N_k to be non-zero, so we need to smooth (interpolate) the Nk counts before computing c* from them
Pretty Good Smoothing • Maximum Likelihood Estimation • Laplace Smoothing • Bayesian prior Smoothing
Pretty Good Smoothing Why is there a 1 here? • Bayesian prior smoothing
Toolkits • With FSAs/FSTs... • Openfst.org • For language modeling • SRILM • SRI Language Modeling Toolkit • All the bells and whistles you can imagine
Word Classes:Parts of Speech • 8 (ish) traditional parts of speech • Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc. • Also known as • parts-of-speech, lexical categories, word classes, morphological classes, lexical tags... • Lots of debate within linguistics and cognitive science community about the number, nature, and universality of these • We'll completely ignore this debate
POS examples • N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adjective purple, tall, ridiculous • ADV adverb unfortunately, slowly • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those
POS Tagging • The process of assigning a part-of-speech marker to each word in a some text. WORD tag the DET koala N put V the DET keys N on P the DET table N
Why POS Tagging is Useful • First step of a vast number of practical tasks • Speech synthesis • How to pronounce "lead"? • INsult inSULT • OBject obJECT • OVERflow overFLOW • DIScount disCOUNT • CONtent conTENT • Parsing • Helpful to know parts of speech before you start parsing • Analogy to lex/yacc (flex/bison) • Information extraction • Finding names, relations, etc. • Machine Translation
Open and Closed Classes • Closed class: a small(ish) fixed membership • Usually function words (short common words which play a role in grammar) • Open class: new ones can be created all the time • English has 4: Nouns, Verbs, Adjectives, Adverbs • Many languages have these 4, but not all! • Nouns are typically where the bulk of the action is with respect to new items
Open Class Words • Nouns • Proper nouns (Boulder, Granby, Beyoncé, Cairo) • English capitalizes these • Common nouns (the rest) • Count nouns and mass nouns • Count: have plurals, get counted: goat/goats, one goat, two goats • Mass: don't get counted (snow, salt, communism) (*two snows) • Adverbs: tend to modify things • Unfortunately, Johnwalked home extremely slowly yesterday • Directional/locative adverbs (here, home, downhill) • Degree adverbs (extremely, very, somewhat) • Manner adverbs (slowly, slinkily, delicately) • Verbs • In English, have morphological affixes (eat/eats/eaten) • With differing patterns of regularity
Closed Class Words Examples: • prepositions: on, under, over, … • particles: up, down, on, off, … • determiners: a, an, the, … • pronouns: she, who, I, .. • conjunctions: and, but, or, … • auxiliary verbs: can, may should, … • numerals: one, two, three, third, …
Prepositions from CELEX
English Particles
Conjunctions
POS Tagging:Choosing a Tagset • There are many potential distinctions we can draw leading to potentially large tagsets • To do POS tagging, we need to choose a standard set of tags to work with • Could pick very coarse tagsets • N, V, Adj, Adv. • More commonly used set is the finer grained, "Penn TreeBank tagset", 45 tags • PRP$, WRB, WP$, VBG • Even more fine-grained tagsets exist
Penn TreeBank POS Tagset
POS Tagging • Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word.
How Hard is POS Tagging? Measuring Ambiguity
Two Methods for POS Tagging • Rule-based tagging • See the text • Stochastic • Probabilistic sequence models • HMM (Hidden Markov Model) tagging • MEMMs (Maximum Entropy Markov Models)
POS Tagging as Sequence Classification • We are given a sentence (an "observation" or "sequence of observations") • Secretariat is expected to race tomorrow • What is the best sequence of tags that corresponds to this sequence of observations? • Probabilistic view: • Consider all possible sequences of tags • Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.
Getting to HMMs • We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest. • Hat ^ means "our estimate of the best one" • Argmaxx f(x) means "the x such that f(x) is maximized"
Getting to HMMs • This equation is guaranteed to give us the best tag sequence • But how to make it operational? How to compute this value? • Intuition of Bayesian inference • Use Bayes rule to transform this equation into a set of other probabilities that are easier to compute
Using Bayes Rule
Likelihood and Prior
Two Kinds of Probabilities • Tag transition probabilities p(ti|ti-1) • Determiners likely to precede adjs and nouns • That/DT flight/NN • The/DT yellow/JJ hat/NN • So we expect P(NN|DT) and P(JJ|DT) to be high • But P(DT|JJ) to be: • Compute P(NN|DT) by counting in a labeled corpus:
Two Kinds of Probabilities • Word likelihood probabilities p(wi|ti) • VBZ (3sg Pres verb) likely to be "is" • Compute P(is|VBZ) by counting in a labeled corpus:
Example: The Verb "race" • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR • People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DTrace/NN for/IN outer/JJ space/NN • How do we pick the right tag?
Disambiguating "race"