1 / 53

Natural Language Processing

Natural Language Processing. Lecture 7—9/19/2013 Jim Martin. Today. More Language modeling (N-grams) Smoothing Finish Good-Turing Pretty good smoothing Bayesian prior smoothing Word classes Part of speech tagging. Smoothing Dealing w/ Zero Counts. Back to Shakespeare

wray
Download Presentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Lecture 7—9/19/2013 Jim Martin

  2. Today • More Language modeling (N-grams) • Smoothing • Finish Good-Turing • Pretty good smoothing • Bayesian prior smoothing • Word classes • Part of speech tagging Speech and Language Processing - Jurafsky and Martin

  3. SmoothingDealing w/ Zero Counts • Back to Shakespeare • Recall that Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams... • So, 99.96% of the possible bigrams were never seen (have zero entries in the table) • Does that mean that any sentence that contains one of those bigrams should have a probability of 0? • For generation (shannon game) it means we’ll never emit those bigrams • But for analysis it’s problematic because if we run across a new bigram in the future then we have no choice but to assign it a probability of zero.. Speech and Language Processing - Jurafsky and Martin

  4. Zero Counts • Some of those zeros are really zeros... • Things that really aren’t ever going to happen • Fewer of these than you might think • On the other hand, some of them are just rare events. • If the training corpus had been a little bigger they would have had a count • What would that count be in all likelihood? Speech and Language Processing - Jurafsky and Martin

  5. Zero Counts • Zipf’s Law (long tail phenomenon) • A small number of events occur with high frequency • A large number of events occur with low frequency • You can quickly collect statistics on the high frequency events • You might have to wait an arbitrarily long time to get good statistics on low frequency events • Result • Our estimates are necessarily sparse! We have no counts at all for the vast number of events we want to estimate. • Answer • Estimate the likelihood of unseen (zero count) N-grams! Speech and Language Processing - Jurafsky and Martin

  6. Laplace Smoothing • Also called Add-One smoothing • Just add one to all the counts! • Very simple • MLE estimate: • Laplace estimate: • Reconstructed counts: Speech and Language Processing - Jurafsky and Martin

  7. BERP Bigram Counts Speech and Language Processing - Jurafsky and Martin

  8. Laplace-Smoothed Bigram Counts Speech and Language Processing - Jurafsky and Martin

  9. Laplace-Smoothed Bigram Probabilities Speech and Language Processing - Jurafsky and Martin

  10. Reconstituted Counts Speech and Language Processing - Jurafsky and Martin

  11. Reconstituted Counts (2) Speech and Language Processing - Jurafsky and Martin

  12. Big Change to the Counts! • C(want to) went from 608 to 238! • P(to|want) from .66 to .26! • Discount d= c*/c • d for “chinese food” =.10!!! A 10x reduction • So in general, Laplace is a blunt instrument • Could use more fine-grained method (add-k) • But Laplace smoothing not generally used for N-grams, as we have much better methods • Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially • For pilot studies • In document classification • Information retrieval • In domains where the number of zeros isn’t so huge. Speech and Language Processing - Jurafsky and Martin

  13. Fun with Unix • Thanks to Ken Church • Unix for Poets Speech and Language Processing - Jurafsky and Martin

  14. Better Smoothing • Intuition used by many smoothing algorithms • Good-Turing • Kneser-Ney • Witten-Bell Use the count of things we’ve seen once to help estimate the count of things we’ve never seen Speech and Language Processing - Jurafsky and Martin

  15. One Fish Two Fish • Imagine you are fishing • There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass • Not sure where this fishing hole is... • You have caught up to now • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish • How likely is it that the next fish to be caught is an eel? • How likely is it that the next fish caught will be a member of newly seen species? • Now how likely is it that the next fish caught will be an eel? Slide adapted from Josh Goodman Speech and Language Processing - Jurafsky and Martin

  16. Good-Turing • Notation: Nx is the frequency-of-frequency-x • So N10=1 • Number of fish species seen 10 times is 1 (carp) • N1=3 • Number of fish species seen 1 is 3 (trout, salmon, eel) • To estimate the probability of an unseen species • Use number of species (words) we’ve seen once • c0* =c1p0 = N1/N • All other estimates are adjusted downward to account for unseen probabilities 3/18 c*(eel) = c*(1) = (1+1) 1/ 3 = 2/3 Slide from Josh Goodman Speech and Language Processing - Jurafsky and Martin

  17. Bigram Frequencies of Frequencies and GT Re-estimates Speech and Language Processing - Jurafsky and Martin

  18. Bigram Frequencies of Frequencies and GT Re-estimates 3*= 4 * (381/642) = 4 * .593 = 2.37 Speech and Language Processing - Jurafsky and Martin

  19. GT Smoothed Bigram Probabilities Speech and Language Processing - Jurafsky and Martin

  20. GT Complications • In practice, assume large counts (c>k for some k) are reliable: • Also, need all the N_k to be non-zero, so we need to smooth (interpolate) the Nk counts before computing c* from them Speech and Language Processing - Jurafsky and Martin

  21. Pretty Good Smoothing • Maximum Likelihood Estimation • Laplace Smoothing • Bayesian prior Smoothing Speech and Language Processing - Jurafsky and Martin 21

  22. Pretty Good Smoothing Why is there a 1 here? • Bayesian prior smoothing Speech and Language Processing - Jurafsky and Martin

  23. Toolkits • With FSAs/FSTs... • Openfst.org • For language modeling • SRILM • SRI Language Modeling Toolkit • All the bells and whistles you can imagine Speech and Language Processing - Jurafsky and Martin

  24. Break • HW Questions? Speech and Language Processing - Jurafsky and Martin

  25. Break • Quiz is Thursday Oct 3. • Chapters 1 to 6 • I’ll post specific readings (when enough people remind (nag) me) Speech and Language Processing - Jurafsky and Martin

  26. Back to Some Linguistics Speech and Language Processing - Jurafsky and Martin

  27. Word Classes:Parts of Speech • 8 (ish) traditional parts of speech • Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc. • Also known as • parts-of-speech, lexical categories, word classes, morphological classes, lexical tags... • Lots of debate within linguistics and cognitive science community about the number, nature, and universality of these • We’ll completely ignore this debate Speech and Language Processing - Jurafsky and Martin

  28. POS examples • N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adjective purple, tall, ridiculous • ADV adverb unfortunately, slowly • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those Speech and Language Processing - Jurafsky and Martin

  29. POS Tagging • The process of assigning a part-of-speech marker to each word in a some text. WORD tag the DET koala N put V the DET keys N on P the DET table N Speech and Language Processing - Jurafsky and Martin

  30. Why POS Tagging is Useful • First step of a vast number of practical tasks • Speech synthesis • How to pronounce “lead”? • INsult inSULT • OBject obJECT • OVERflow overFLOW • DIScount disCOUNT • CONtent conTENT • Parsing • Helpful to know parts of speech before you start parsing • Analogy to lex/yacc (flex/bison) • Information extraction • Finding names, relations, etc. • Machine Translation Speech and Language Processing - Jurafsky and Martin

  31. Open and Closed Classes • Closed class: a small(ish) fixed membership • Usually function words (short common words which play a role in grammar) • Open class: new ones can be created all the time • English has 4: Nouns, Verbs, Adjectives, Adverbs • Many languages have these 4, but not all! • Nouns are typically where the bulk of the action is with respect to new items Speech and Language Processing - Jurafsky and Martin

  32. Open Class Words • Nouns • Proper nouns (Boulder, Granby, Beyoncé, Cairo) • English capitalizes these • Common nouns (the rest) • Count nouns and mass nouns • Count: have plurals, get counted: goat/goats, one goat, two goats • Mass: don’t get counted (snow, salt, communism) (*two snows) • Adverbs: tend to modify things • Unfortunately, Johnwalked home extremely slowly yesterday • Directional/locative adverbs (here, home, downhill) • Degree adverbs (extremely, very, somewhat) • Manner adverbs (slowly, slinkily, delicately) • Verbs • In English, have morphological affixes (eat/eats/eaten) • With differing patterns of regularity Speech and Language Processing - Jurafsky and Martin

  33. Closed Class Words Examples: • prepositions: on, under, over, … • particles: up, down, on, off, … • determiners: a, an, the, … • pronouns: she, who, I, .. • conjunctions: and, but, or, … • auxiliary verbs: can, may should, … • numerals: one, two, three, third, … Speech and Language Processing - Jurafsky and Martin

  34. Prepositions from CELEX Speech and Language Processing - Jurafsky and Martin

  35. English Particles Speech and Language Processing - Jurafsky and Martin

  36. Conjunctions Speech and Language Processing - Jurafsky and Martin

  37. POS Tagging:Choosing a Tagset • There are many potential distinctions we can draw leading to potentially large tagsets • To do POS tagging, we need to choose a standard set of tags to work with • Could pick very coarse tagsets • N, V, Adj, Adv. • More commonly used set is the finer grained, “Penn TreeBank tagset”, 45 tags • PRP$, WRB, WP$, VBG • Even more fine-grained tagsets exist Speech and Language Processing - Jurafsky and Martin

  38. Penn TreeBank POS Tagset Speech and Language Processing - Jurafsky and Martin

  39. POS Tagging • Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word. Speech and Language Processing - Jurafsky and Martin

  40. How Hard is POS Tagging? Measuring Ambiguity Speech and Language Processing - Jurafsky and Martin

  41. Two Methods for POS Tagging • Rule-based tagging • See the text • Stochastic • Probabilistic sequence models • HMM (Hidden Markov Model) tagging • MEMMs (Maximum Entropy Markov Models) Speech and Language Processing - Jurafsky and Martin

  42. POS Tagging as Sequence Classification • We are given a sentence (an “observation” or “sequence of observations”) • Secretariat is expected to race tomorrow • What is the best sequence of tags that corresponds to this sequence of observations? • Probabilistic view: • Consider all possible sequences of tags • Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn. Speech and Language Processing - Jurafsky and Martin

  43. Getting to HMMs • We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest. • Hat ^ means “our estimate of the best one” • Argmaxx f(x) means “the x such that f(x) is maximized” Speech and Language Processing - Jurafsky and Martin

  44. Getting to HMMs • This equation is guaranteed to give us the best tag sequence • But how to make it operational? How to compute this value? • Intuition of Bayesian inference • Use Bayes rule to transform this equation into a set of other probabilities that are easier to compute Speech and Language Processing - Jurafsky and Martin

  45. Using Bayes Rule Speech and Language Processing - Jurafsky and Martin

  46. Likelihood and Prior Speech and Language Processing - Jurafsky and Martin

  47. Two Kinds of Probabilities • Tag transition probabilities p(ti|ti-1) • Determiners likely to precede adjs and nouns • That/DT flight/NN • The/DT yellow/JJ hat/NN • So we expect P(NN|DT) and P(JJ|DT) to be high • But P(DT|JJ) to be: • Compute P(NN|DT) by counting in a labeled corpus: Speech and Language Processing - Jurafsky and Martin

  48. Two Kinds of Probabilities • Word likelihood probabilities p(wi|ti) • VBZ (3sg Pres verb) likely to be “is” • Compute P(is|VBZ) by counting in a labeled corpus: Speech and Language Processing - Jurafsky and Martin

  49. Example: The Verb “race” • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR • People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DTrace/NN for/IN outer/JJ space/NN • How do we pick the right tag? Speech and Language Processing - Jurafsky and Martin

  50. Disambiguating “race” Speech and Language Processing - Jurafsky and Martin

More Related