320 likes | 358 Views
Explore advanced methods like stochastic tagging, HMM taggers, and conditional probability for accurate part of speech tagging in natural language processing. Learn how to calculate tag probabilities and optimize tag sequences effectively.
E N D
LING 138/238 SYMBSYS 138Intro to Computer Speech and Language Processing Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider Thanks to Dan Jurafsky, Jim Martin, Dekang Lin, and Bonnie Dorr for some of the examples and details in these slides! LING 138/238 Autumn 2004
Week 3: Part of Speech tagging • Part of speech tagging • Parts of speech • What’s POS tagging good for anyhow? • Tag sets • Rule-based tagging • Statistical tagging • “TBL” tagging LING 138/238 Autumn 2004
Rule-based tagging • Start with a dictionary • Assign all possible tags to words from the dictionary • Write rules by hand to selectively remove tags • Leaving the correct tag for each word. LING 138/238 Autumn 2004
3 methods for POS tagging • Rule-based tagging • (ENGTWOL) • Stochastic (=Probabilistic) tagging • HMM (Hidden Markov Model) tagging • Transformation-based tagging • Brill tagger LING 138/238 Autumn 2004
Statistical Tagging • Based on probability theory • Today we’ll go over a few basic ideas of probability theory • Then we’ll do HMM and TBL tagging. LING 138/238 Autumn 2004
Probability and part of speech tags • What’s the probability of drawing a 2 from a deck of 52 cards with four 2s? • What’s the probability of a random word (from a random dictionary page) being a verb? LING 138/238 Autumn 2004
Probability and part of speech tags • What’s the probability of a random word (from a random dictionary page) being a verb? • How to compute each of these • All words = just count all the words in the dictionary • # of ways to get a verb: number of words which are verbs! • If a dictionary has 50,000 entries, and 10,000 are verbs…. P(V) is 10000/50000 = 1/5 = .20 LING 138/238 Autumn 2004
Probability and Independent Events • What’s the probability of picking two verbs randomly from the dictionary • Events are independent, so multiply probs P(w1=V,w2=V) = P(V) * P(V) = 1/5 * 1/5 = 0.04 • What if events are not independent? LING 138/238 Autumn 2004
Conditional Probability • Written P(A|B). • Let’s say A is “it’s raining”. • Let’s say P(A) in drought-stricken California is .01 • Let’s say B is “it was sunny ten minutes ago” • P(A|B) means “what is the probability of it raining now if it was sunny 10 minutes ago” • P(A|B) is probably way less than P(A) • Let’s say P(A|B) is .0001 LING 138/238 Autumn 2004
Conditional Probability and Tags • P(Verb) is the probability of a randomly selected word being a verb. • P(Verb|race) is “what’s the probability of a word being a verb given that it’s the word “race”? • Race can be a noun or a verb. • It’s more likely to be a noun. • P(Verb|race) can be estimated by looking at some corpus and saying “out of all the times we saw ‘race’, how many were verbs? • In Brown corpus, P(Verb|race) = 96/98 = .98 • How to calculate for a tag sequence, say P(NN|DT)? LING 138/238 Autumn 2004
Stochastic Tagging • Based on probability of certain tag occurring given various possibilities • Necessitates a training corpus • No probabilities for words not in corpus. • Training corpus may be too different from test corpus. LING 138/238 Autumn 2004
Stochastic Tagging (cont.) Simple Method: Choose most frequent tag in training text for each word! • Result: 90% accuracy • Why? • Baseline: Others will do better • HMM is an example LING 138/238 Autumn 2004
HMM Tagger • Intuition: Pick the most likely tag for this word. • HMM Taggers choose tag sequence that maximizes this formula: • P(word|tag) × P(tag|previous n tags) • Let T = t1,t2,…,tnLet W = w1,w2,…,wnFind POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W. LING 138/238 Autumn 2004
HMM Tagger • argmaxT P(T|W) • argmaxTP(W|T)P(T) Bayes Rule • argmaxTP(w1…wn|t1…tn)P(t1…tn) Remember, we are trying to find the sequence T that will maximize P(T|W) so this equation is calculated over the whole sentence. LING 138/238 Autumn 2004
HMM Tagger • Assume word is dependent only on its own POS tag: it is independent of the others around it • argmaxT[P(w1|t1)P(w2|t2)…P(wn|tn)][P(t1)P(t2|t1)…P(tn|tn-1)] LING 138/238 Autumn 2004
Bigram HMM Tagger • Also assume that probability is dependent only on previous tag • For each word and possible tag, need to calculate: P(ti) = P(wi|ti)P(ti|ti-1) then multiply this over each possible tag and each word over the sequence of tags LING 138/238 Autumn 2004
Bigram HMM Tagger • How do we compute P(ti|ti-1)? • c(ti-1ti)/c(ti-1) • How do we compute P(wi|ti)? • c(wi,ti)/c(ti) • How do we compute the most probable tag sequence? • Viterbi algorithm LING 138/238 Autumn 2004
An Example • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • to/TO race/???the/DT race/??? • ti = argmaxj P(tj|ti-1)P(wi|tj) • i = num of word in sequence, j = num among possible tags • max[P(VB|TO)P(race|VB) , P(NN|TO)P(race|NN)] • Brown: • P(NN|TO) = .021 × P(race|NN) = .00041 = .000007 • P(VB|TO) = .34 × P(race|VB) = .00003 = .00001 LING 138/238 Autumn 2004
Viterbi Algorithm S1 S2 S3 S4 S5 LING 138/238 Autumn 2004
Transformation-Based Tagging (Brill Tagging) • Combination of Rule-based and stochastic tagging methodologies • Like rule-based because rules are used to specify tags in a certain environment • Like stochastic approach because machine learning is used—with tagged corpus as input • Input: • tagged corpus • dictionary (with most frequent tags) LING 138/238 Autumn 2004
Transformation-Based Tagging (cont.) • Basic Idea: • Set the most probable tag for each word as a start value • Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order • Training is done on tagged corpus: • Write a set of rule templates • Among the set of rules, find one with highest score • Continue from 2 until lowest score threshold is passed • Keep the ordered set of rules • Rules make errors that are corrected by later rules LING 138/238 Autumn 2004
TBL Rule Application • Tagger labels every word with its most-likely tag • For example: race has the following probabilities in the Brown corpus: • P(NN|race) = .98 • P(VB|race)= .02 • Transformation rules make changes to tags • “Change NN to VB when previous tag is TO”… is/VBZ expected/VBN to/TO race/NN tomorrow/NNbecomes… is/VBZ expected/VBN to/TO race/VB tomorrow/NN LING 138/238 Autumn 2004
TBL: Rule Learning • 2 parts to a rule • Triggering environment • Rewrite rule • The range of triggering environments of templates (from Manning & Schutze 1999:363) Schema ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * LING 138/238 Autumn 2004
TBL: The Tagging Algorithm • Step 1: Label every word with most likely tag (from dictionary) • Step 2: Check every possible transformation & select one which most improves tagging • Step 3: Re-tag corpus applying the rules • Repeat 2-3 until some criterion is reached, e.g., X% correct with respect to training corpus • RESULT: Sequence of transformation rules LING 138/238 Autumn 2004
TBL: Rule Learning (cont.) • Problem: Could apply transformations ad infinitum! • Constrain the set of transformations with “templates”: • Replace tag X with tag Y, provided tag Z or word Z’ appears in some position • Rules are learned in ordered sequence • Rules may interact. • Rules are compact and can be inspected by humans LING 138/238 Autumn 2004
Templates for TBL LING 138/238 Autumn 2004
TBL: Problems • First 100 rules achieve 96.8% accuracyFirst 200 rules achieve 97.0% accuracy • Execution Speed: TBL tagger is slower than HMM approach • Learning Speed: Brill’s implementation over a day (600k tokens) BUT … (1) Learns small number of simple, non-stochastic rules (2) Can be made to work faster with FST (3) Best performing algorithm on unknown words LING 138/238 Autumn 2004
Tagging Unknown Words • New words added to (newspaper) language 20+ per month • Plus many proper names … • Increases error rates by 1-2% • Method 1: assume they are nouns • Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set. • Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN. LING 138/238 Autumn 2004
Using Morphological Information LING 138/238 Autumn 2004
Evaluation • The result is compared with a manually coded “Gold Standard” • Typically accuracy reaches 96-97% • This may be compared with result for a baseline tagger (one that uses no context). • Important: 100% is impossible even for human annotators. LING 138/238 Autumn 2004