80 likes | 223 Views
Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model. Brian Highfill. Part of Speech Tagging. Train a model on a set of hand-tagged sentences Find best sequence of POS tags for new sentence Generative Models Hidden Markov Model HMM Discriminative Models
E N D
Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill
Part of Speech Tagging • Train a model on a set of hand-tagged sentences • Find best sequence of POS tags for new sentence • Generative Models • Hidden Markov Model HMM • Discriminative Models • Maximum Entropy Markov Model (MEMM) • Brown Corpus • ~57,000 tagged sentences • 87 tags (reduced to 45 for Penn TreeBank tagging) • ~300 tags including compound tags • that_DT fire's_NN+BEZ too_QL big_JJ ._. • “fire’s” = fire_NN is_BEZ
Hidden Markov Models • Set of hidden states (POS tags) • Set of observations (word tokens) • Dependents ONLY on current tag • HMM parameters • Transition probabilities : P(ti|t0…ti) = P(ti|ti-1) • Observation probabilities: P(wi|t0…tn,w0…wn) = P(wi|ti) • Initial tag distribution: P(t0)
HMM Best Tag Sequence • For HMM, the Viterbi algorithm finds the most probable tagging for a new sentence • For re-ranking later, we want not the best tagging but the k best tagging for each sentence
HMM Beam Search • Step1 • Enumerate all possible tags for the first word • Step 2 • Evaluate each tagging using trained HMM keep only the best k (first word sentence taggings) • Step 3 • For each of the k taggings of the previous step, enumerate all possible tags for the second word • Step 4 • Evaluate each two-word sentence tagging and discard all the k best. • Repeat for all words in the sentence
MaxEnt Re-ranking • After beam search, we have the k “best” taggings for our sentence • Use trained MaxEnt model to select most probable sequence of tags
Results • Feature • Current word • Previous tag • <Word AND previous Tag> • Word contains a numeral • “-ing” • “-ness” • “-ity” • “-ed” • “-able” • “-s” • “-ion” • “-al” • “-ive” • “-ly” • Word is capitalized • Word is hyphenated • Word is all uppercase • Word is all uppercase with a numeral • Word is capitalized and a word ending in “Co.” or “Inc.” is found within 3 words ahead
Results • Baseline “Most frequent class tagger”: 73.41% (24%) • HMM Viterbi tagger: 92.96% (32.76% on <UNK>)