SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 15, 2004

Class Pace and Schedule • Need a foundation before you can do anything interesting. • Tokenizing, Tagging, Regex’s • Text Classification Principles and Techniques • Training vs. Testing, processing corpora • Through (approximately) the 6th week, keep doing exercises from the NLTK tutorials to build that foundation. • 2 more homeworks • I’m trying to make them bite-sized pieces • 7th – 10th Group Miniproject on Enron Corpus • Will involve classification or Information Extraction • Different groups will do different things • May have a homework within this timeframe • 11th – 15th Another Miniproject • Either on Enron project or your choices • I will suggest ideas; you can propose them too • May also have 1-2 other homeworks in this timeframe

Language Modeling • An fundamental concept in NLP • Main idea: • For a given language, some words are more likely than others to follow each other, or • You can predict (with some degree of accuracy) the probability that a given word will follow another word. • Illustration: • Distributions of words in class-participation exercise.

Next Word Prediction • From a NY Times story... • Stocks ... • Stocks plunged this …. • Stocks plunged this morning, despite a cut in interest rates • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall ... • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began Adapted from slide by Bonnie Dorr

Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last … • Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last Tuesday's terrorist attacks. Adapted from slide by Bonnie Dorr

Human Word Prediction • Clearly, at least some of us have the ability to predict future words in an utterance. • How? • Domain knowledge • Syntactic knowledge • Lexical knowledge Adapted from slide by Bonnie Dorr

Claim • A useful part of the knowledge needed to allow word prediction can be captured using simple statistical techniques • In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence) Adapted from slide by Bonnie Dorr

Applications • Why do we want to predict a word, given some preceding words? • Rank the likelihood of sequences containing various alternative hypotheses, e.g. for ASR Theatre owners say popcorn/unicorn sales have doubled... • Assess the likelihood/goodness of a sentence • for text generation or machine translation. The doctor recommended a cat scan. El doctor recommendó una exploración del gato. Adapted from slide by Bonnie Dorr

N-Gram Models of Language • Use the previous N-1 words in a sequence to predict the next word • Language Model (LM) • unigrams, bigrams, trigrams,… • How do we train these models? • Very large corpora Adapted from slide by Bonnie Dorr

Simple N-Grams • Assume a language has V word types in its lexicon, how likely is word x to follow word y? • Simplest model of word probability: 1/V • Alternative 1: estimate likelihood of x occurring in new text based on its general frequency of occurrence estimated from a corpus (unigram probability) popcorn is more likely to occur than unicorn • Alternative 2: condition the likelihood of x occurring in the context of previous words (bigrams, trigrams,…) mythical unicorn is more likely than mythical popcorn Adapted from slide by Bonnie Dorr

A Word on Notation • P(unicorn) • Read this as “The probability of seeing the token unicorn” • Unigram tagger uses this. • P(unicorn|mythical) • Called the Conditional Probability. • Read this as “The probability of seeing the token unicorn given that you’ve seen the token mythical • Bigram tagger uses this. • Related to the conditional frequency distributions that we’ve been working with.

Computing the Probability of a Word Sequence • Compute the product of component conditional probabilities? • P(the mythical unicorn) = P(the) P(mythical|the) P(unicorn|the mythical) • The longer the sequence, the less likely we are to find it in a training corpus P(Most biologists and folklore specialists believe that in fact the mythical unicorn horns derived from the narwhal) • Solution: approximate using n-grams Adapted from slide by Bonnie Dorr

Bigram Model • Approximate by • P(unicorn|the mythical) by P(unicorn|mythical) • Markov assumption: • The probability of a word depends only on the probability of a limited history • Generalization: • The probability of a word depends only on the probability of the n previous words • trigrams, 4-grams, … • the higher n is, the more data needed to train • backoff models Adapted from slide by Bonnie Dorr

Using N-Grams • For N-gram models • P(wn-1,wn) = P(wn | wn-1) P(wn-1) • By the Chain Rule we can decompose a joint probability, e.g. P(w1,w2,w3) P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) … P(wn-1|wn) P(wn) For bigrams then, the probability of a sequence is just the product of the conditional probabilities of its bigrams P(the,mythical,unicorn) = P(unicorn|mythical)P(mythical|the) P(the|<start>) Adapted from slide by Bonnie Dorr

Training and Testing • N-Gram probabilities come from a training corpus • overly narrow corpus: probabilities don't generalize • overly general corpus: probabilities don't reflect task or domain • A separate test corpus is used to evaluate the model, typically using standard metrics • held out test set; development test set • cross validation • results tested for statistical significance Adapted from slide by Bonnie Dorr

A Simple Example • From BeRP: The Berkeley Restaurant Project • A testbed for a Speech Recognition project • System prompts user for information in order to fill in slots in a restaurant database. • Type of food, hours open, how expensive • After getting lots of input, can compute how likely it is that someone will say X given that they already said Y. P(I want to each Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) Adapted from slide by Bonnie Dorr

Eat on .16 Eat Thai .03 Eat some .06 Eat breakfast .03 Eat lunch .06 Eat in .02 Eat dinner .05 Eat Chinese .02 Eat at .04 Eat Mexican .02 Eat a .04 Eat tomorrow .01 Eat Indian .04 Eat dessert .007 Eat today .03 Eat British .001 A Bigram Grammar Fragment from BeRP Adapted from slide by Bonnie Dorr

<start> I .25 Want some .04 <start> I’d .06 Want Thai .01 <start> Tell .04 To eat .26 <start> I’m .02 To have .14 I want .32 To spend .09 I would .29 To be .02 I don’t .08 British food .60 I have .04 British restaurant .15 Want to .65 British cuisine .01 Want a .05 British lunch .01 Adapted from slide by Bonnie Dorr

P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = .000080 • vs. I want to eat Chinese food = .00015 • Probabilities seem to capture “syntactic'' facts, “world knowledge'' • eat is often followed by an NP • British food is not too popular • N-gram models can be trained by counting and normalization Adapted from slide by Bonnie Dorr

Tagging with lexical frequencies • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • Problem: assign a tag to race given its lexical frequency • Solution: we choose the tag that has the greater • P(race|VB)Probability of “race” given “VB” on prior word • P(race|NN)Probability of “race” given “NN” on prior word • Actual estimate from the Switchboard corpus: • P(race|NN) = .00041 • P(race|VB) = .00003 Modified from Massio Poesio's lecture

Combining Taggers • Use more accurate algorithms when we can, backoff to wider coverage when needed. • Try tagging the token with the 1st order tagger. • If the 1st order tagger is unable to find a tag for the token, try finding a tag with the 0th order tagger. • If the 0th order tagger is also unable to find a tag, use the NN_CD_Tagger to find a tag. Modified from Diane Litman's version of Steve Bird's notes

BackoffTagger class >>> train_toks = TaggedTokenizer().tokenize(tagged_text_str) # Construct the taggers >>> tagger1 = NthOrderTagger(1, SUBTOKENS=‘WORDS’) >>> tagger2 = UnigramTagger() # 0th order >>> tagger3 = NN_CD_Tagger() # Train the taggers >>> for tok in train_toks: tagger1.train(tok) tagger2.train(tok) Modified from Diane Litman's version of Steve Bird's notes

Backoff (continued) # Combine the taggers (in order, by specificity) > tagger = BackoffTagger([tagger1, tagger2, tagger3]) # Use the combined tagger > accuracy = tagger_accuracy(tagger, unseen_tokens) Modified from Diane Litman's version of Steve Bird's notes

Rule-Based Tagger • The Linguistic Complaint • Where is the linguistic knowledge of a tagger? • Just a massive table of numbers • Aren’t there any linguistic insights that could emerge from the data? • Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun. Modified from Diane Litman's version of Steve Bird's notes

The Brill tagger • An example of TRANSFORMATION-BASED LEARNING • Very popular (freely available, works fairly well) • A SUPERVISED method: requires a tagged corpus • Basic idea: do a quick job first (using frequency), then revise it using contextual rules Slide modified from Massimo Poesio's

Brill Tagging: In more detail • Start with simple (less accurate) rules…learn better ones from tagged corpus • Tag each word initially with most likely POS • Examine set of transformationsto see which improves tagging decisions compared to tagged corpus • Re-tag corpus using best transformation • Repeat until, e.g., performance doesn’t improve • Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

An example • Examples: • They are expected to racetomorrow. • Therace for outer space. • Tagging algorithm: • Tag all uses of “race” as NN (most likely tag in the Brown corpus) • They are expected to race/NN tomorrow • the race/NN for outer space • Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO: • They are expected to race/VB tomorrow • the race/NN for outer space Slide modified from Massimo Poesio's

First 20 Transformation Rules From: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging Eric Brill. Computational Linguistics. December, 1995.

Transformation Rules for Tagging Unknown Words From: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging Eric Brill. Computational Linguistics. December, 1995.

Additional issues Most of the difference in performance between POS algorithms depends on their treatment of UNKNOWN WORDS Class-based N-grams Adapted from Massio Peosio's

Evaluating a Tagger • Tagged tokens – the original data • Untag (exclude) the data • Tag the data with your own tagger • Compare the original and new tags • Iterate over the two lists checking for identity and counting • Accuracy = fraction correct Modified from Diane Litman's version of Steve Bird's notes

Assessing the Errors Why the tuple method? Dictionaries cannot be indexed by lists, so convert lists to tuples. exclude returns a new token containing only the properties that are not named in the given list.

Assessing the Errors

Upcoming • First assignment due 8pm tonight • Turn in on course Assignments page • For next week: • Read the Chunking tutorial. • (The pdf version has the missing images) • http://nltk.sourceforge.net/tutorial/chunking.pdf • We’ll have an assignment getting practice with this.

SIMS 290-2: Applied Natural Language Processing