1 / 30

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing. Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider. Thanks to Dan Jurafsky, Jim Martin, Dekang Lin, and Bonnie Dorr for some of the examples and details in these slides!.

jreaves
Download Presentation

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING 138/238 SYMBSYS 138Intro to Computer Speech and Language Processing Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider Thanks to Dan Jurafsky, Jim Martin, Dekang Lin, and Bonnie Dorr for some of the examples and details in these slides! LING 138/238 Autumn 2004

  2. Week 3: Part of Speech tagging • Part of speech tagging • Parts of speech • What’s POS tagging good for anyhow? • Tag sets • Rule-based tagging • Statistical tagging • “TBL” tagging LING 138/238 Autumn 2004

  3. Rule-based tagging • Start with a dictionary • Assign all possible tags to words from the dictionary • Write rules by hand to selectively remove tags • Leaving the correct tag for each word. LING 138/238 Autumn 2004

  4. 3 methods for POS tagging • Rule-based tagging • (ENGTWOL) • Stochastic (=Probabilistic) tagging • HMM (Hidden Markov Model) tagging • Transformation-based tagging • Brill tagger LING 138/238 Autumn 2004

  5. Statistical Tagging • Based on probability theory • Today we’ll go over a few basic ideas of probability theory • Then we’ll do HMM and TBL tagging. LING 138/238 Autumn 2004

  6. Probability and part of speech tags • What’s the probability of drawing a 2 from a deck of 52 cards with four 2s? • What’s the probability of a random word (from a random dictionary page) being a verb? LING 138/238 Autumn 2004

  7. Probability and part of speech tags • What’s the probability of a random word (from a random dictionary page) being a verb? • How to compute each of these • All words = just count all the words in the dictionary • # of ways to get a verb: number of words which are verbs! • If a dictionary has 50,000 entries, and 10,000 are verbs…. P(V) is 10000/50000 = 1/5 = .20 LING 138/238 Autumn 2004

  8. Probability and Independent Events • What’s the probability of picking two verbs randomly from the dictionary • Events are independent, so multiply probs P(w1=V,w2=V) = P(V) * P(V) = 1/5 * 1/5 = 0.04 • What if events are not independent? LING 138/238 Autumn 2004

  9. Conditional Probability • Written P(A|B). • Let’s say A is “it’s raining”. • Let’s say P(A) in drought-stricken California is .01 • Let’s say B is “it was sunny ten minutes ago” • P(A|B) means “what is the probability of it raining now if it was sunny 10 minutes ago” • P(A|B) is probably way less than P(A) • Let’s say P(A|B) is .0001 LING 138/238 Autumn 2004

  10. Conditional Probability and Tags • P(Verb) is the probability of a randomly selected word being a verb. • P(Verb|race) is “what’s the probability of a word being a verb given that it’s the word “race”? • Race can be a noun or a verb. • It’s more likely to be a noun. • P(Verb|race) can be estimated by looking at some corpus and saying “out of all the times we saw ‘race’, how many were verbs? • In Brown corpus, P(Verb|race) = 96/98 = .98 • How to calculate for a tag sequence, say P(NN|DT)? LING 138/238 Autumn 2004

  11. Stochastic Tagging • Based on probability of certain tag occurring given various possibilities • Necessitates a training corpus • No probabilities for words not in corpus. • Training corpus may be too different from test corpus. LING 138/238 Autumn 2004

  12. Stochastic Tagging (cont.) Simple Method: Choose most frequent tag in training text for each word! • Result: 90% accuracy • Why? • Baseline: Others will do better • HMM is an example LING 138/238 Autumn 2004

  13. HMM Tagger • Intuition: Pick the most likely tag for this word. • HMM Taggers choose tag sequence that maximizes this formula: • P(word|tag) × P(tag|previous n tags) • Let T = t1,t2,…,tnLet W = w1,w2,…,wnFind POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W. LING 138/238 Autumn 2004

  14. HMM Tagger • argmaxT P(T|W) • argmaxTP(W|T)P(T) Bayes Rule • argmaxTP(w1…wn|t1…tn)P(t1…tn) Remember, we are trying to find the sequence T that will maximize P(T|W) so this equation is calculated over the whole sentence. LING 138/238 Autumn 2004

  15. HMM Tagger • Assume word is dependent only on its own POS tag: it is independent of the others around it • argmaxT[P(w1|t1)P(w2|t2)…P(wn|tn)][P(t1)P(t2|t1)…P(tn|tn-1)] LING 138/238 Autumn 2004

  16. Bigram HMM Tagger • Also assume that probability is dependent only on previous tag • For each word and possible tag, need to calculate: P(ti) = P(wi|ti)P(ti|ti-1) then multiply this over each possible tag and each word over the sequence of tags LING 138/238 Autumn 2004

  17. Bigram HMM Tagger • How do we compute P(ti|ti-1)? • c(ti-1ti)/c(ti-1) • How do we compute P(wi|ti)? • c(wi,ti)/c(ti) • How do we compute the most probable tag sequence? • Viterbi algorithm LING 138/238 Autumn 2004

  18. An Example • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • to/TO race/???the/DT race/??? • ti = argmaxj P(tj|ti-1)P(wi|tj) • i = num of word in sequence, j = num among possible tags • max[P(VB|TO)P(race|VB) , P(NN|TO)P(race|NN)] • Brown: • P(NN|TO) = .021 × P(race|NN) = .00041 = .000007 • P(VB|TO) = .34 × P(race|VB) = .00003 = .00001 LING 138/238 Autumn 2004

  19. Viterbi Algorithm S1 S2 S3 S4 S5 LING 138/238 Autumn 2004

  20. Transformation-Based Tagging (Brill Tagging) • Combination of Rule-based and stochastic tagging methodologies • Like rule-based because rules are used to specify tags in a certain environment • Like stochastic approach because machine learning is used—with tagged corpus as input • Input: • tagged corpus • dictionary (with most frequent tags) LING 138/238 Autumn 2004

  21. Transformation-Based Tagging (cont.) • Basic Idea: • Set the most probable tag for each word as a start value • Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order • Training is done on tagged corpus: • Write a set of rule templates • Among the set of rules, find one with highest score • Continue from 2 until lowest score threshold is passed • Keep the ordered set of rules • Rules make errors that are corrected by later rules LING 138/238 Autumn 2004

  22. TBL Rule Application • Tagger labels every word with its most-likely tag • For example: race has the following probabilities in the Brown corpus: • P(NN|race) = .98 • P(VB|race)= .02 • Transformation rules make changes to tags • “Change NN to VB when previous tag is TO”… is/VBZ expected/VBN to/TO race/NN tomorrow/NNbecomes… is/VBZ expected/VBN to/TO race/VB tomorrow/NN LING 138/238 Autumn 2004

  23. TBL: Rule Learning • 2 parts to a rule • Triggering environment • Rewrite rule • The range of triggering environments of templates (from Manning & Schutze 1999:363) Schema ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * LING 138/238 Autumn 2004

  24. TBL: The Tagging Algorithm • Step 1: Label every word with most likely tag (from dictionary) • Step 2: Check every possible transformation & select one which most improves tagging • Step 3: Re-tag corpus applying the rules • Repeat 2-3 until some criterion is reached, e.g., X% correct with respect to training corpus • RESULT: Sequence of transformation rules LING 138/238 Autumn 2004

  25. TBL: Rule Learning (cont.) • Problem: Could apply transformations ad infinitum! • Constrain the set of transformations with “templates”: • Replace tag X with tag Y, provided tag Z or word Z’ appears in some position • Rules are learned in ordered sequence • Rules may interact. • Rules are compact and can be inspected by humans LING 138/238 Autumn 2004

  26. Templates for TBL LING 138/238 Autumn 2004

  27. TBL: Problems • First 100 rules achieve 96.8% accuracyFirst 200 rules achieve 97.0% accuracy • Execution Speed: TBL tagger is slower than HMM approach • Learning Speed: Brill’s implementation over a day (600k tokens) BUT … (1) Learns small number of simple, non-stochastic rules (2) Can be made to work faster with FST (3) Best performing algorithm on unknown words LING 138/238 Autumn 2004

  28. Tagging Unknown Words • New words added to (newspaper) language 20+ per month • Plus many proper names … • Increases error rates by 1-2% • Method 1: assume they are nouns • Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set. • Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN. LING 138/238 Autumn 2004

  29. Using Morphological Information LING 138/238 Autumn 2004

  30. Evaluation • The result is compared with a manually coded “Gold Standard” • Typically accuracy reaches 96-97% • This may be compared with result for a baseline tagger (one that uses no context). • Important: 100% is impossible even for human annotators. LING 138/238 Autumn 2004

More Related