660 likes | 786 Views
Ling 570 Day 6: HMM POS Taggers. Overview. Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details. HMM POS Tagging. HMM Tagger. : How likely is this tag given n prev tags? Often we use just one previous tag
E N D
Overview • Open Questions • HMM POS Tagging • Review Viterbi algorithm • Training and Smoothing • HMM Implementation Details
HMM Tagger : • How likely is this tag given n prev tags? • Often we use just one previous tag • Can model with a tag-tag matrix
HMM Tagger : • The probability of the word given a tag(not vice versa!) • We model this with a word-tag matrix
HMM Tagger Why and not ? • Take the following examples (from J&M): • Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/?? for/IN outer/JJ space/NN
HMM Tagger Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN • Maximize • We can choose between • Pr(VB|TO) x Pr(race|VB) x Pr(NN|VB) • Pr(NN|TO) x Pr(race|NN) x Pr(NN|NN)
The good HMM Tagger • From the Brown/Switchboard corpus: • P(VB|TO) = .34 • P(NN|TO) = .021 • P(race|VB) = .00003 • P(race|NN) = .00041 • P(VB|TO) x P(race|VB) = .34 x .00003 = .00001 • P(NN|TO) x P(race|NN) = .021 x .00041 = .000007 a. TO followed by VB in the context of race is more probable (‘race’ really has no effect here).
HMM Philosophy • Imagine: the author, when creating this sentence, also had in mind the parts-of-speech of each of these words. • After the fact, we’re now trying to recover those parts of speech. • They’re the hidden part of the Markov model.
What happens when we do it the wrong way? • Invert word and tag, P(t|w) instead of P(w|t): • P(VB|race) = .02 • P(NN|race) = .98 • 2 would drown out virtually any other probability! We’d always tag race with NN!
What happens when we do it the wrong way? • Invert word and tag, P(t|w) instead of P(w|t): • P(VB|race) = .02 • P(NN|race) = .98 • 2 would drown out virtually any other probability! We’d always tag race with NN! • Also, it would double-predict every tag: • This is not a well formed model!
N-gram POS tagging N-gram model:
N-gram POS tagging N-gram model: Predict current tag conditioned on prior n-1 tags
N-gram POS tagging N-gram model: Predict current tag conditioned on prior n-1 tags Predict word conditioned on current tag
N-gram POS tagging N-gram model: Bigram model:
N-gram POS tagging N-gram model: Trigram model:
HMM bigram tagger • Consists of • States: POS tags • Observations: words in the vocabulary • Transitions: • Emissions: • Initial distribution:
HMM trigram tagger • Consists of • States: pairs of tags • Observations: still words in the vocabulary • Transition probabilities:,where • Emissions where for some tag • Initial distribution
Training • An HMM needs to be trained on the following: • The initial state probabilities • The state transition probabilities • The tag-tag matrix • The emission probabilities • The tag-word matrix
Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm
Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm Transition distribution
Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm Emission distribution
Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm
Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm
Consider two examples Mariners hit a home run Mariners hit made the news
Consider two examples N V DT N N Mariners hit a home run N N V DT N Mariners hit made the news
Parameters • As probabilities, they get very small
Parameters • As probabilities, they get very small • As log probabilities, they won’t underflow… • …and we can just add them
Viterbi • Initialization: • Recursion: • Termination:
Pseudocode function Viterbi( observations, states) matrix of matrix of for each state // initialize for each time // update for each state // max final returnRecoverBestSequence(bt, )
Pseudocode function RecoverBestSequence(, , ) path = array() path.add() while () path.add() return reverse(path)
Training • Maximum Likelihood estimates for POS tagging:
Why Smoothing? • Zero counts
Why Smoothing? • Zero counts • Handle missing tag sequences: • Smooth transition probabilities
Why Smoothing? • Zero counts • Handle missing tag sequences: • Smooth transition probabilities • Handle unseen words: • Smooth observation probabilities
Why Smoothing? • Zero counts • Handle missing tag sequences: • Smooth transition probabilities • Handle unseen words: • Smooth observation probabilities • Handle unseen (word,tag) pairs where both are known
Smoothing Tag Sequences • Haven’t seen • How can we estimate?
Smoothing Tag Sequences • Haven’t seen • How can we estimate? • Add some fake counts! • MLE estimate
Smoothing Tag Sequences • Haven’t seen • How can we estimate? • Add some fake counts! • Add one smoothing: • What is ??? if we want a normalized distribution?
Smoothing Tag Sequences • Haven’t seen • How can we estimate? • Add some fake counts! • Add one smoothing: • is the number of tags – then it still sums to 1. • In general this is not a good way to smooth, but it’s enough to get you by for your next assignment.
Smoothing Emission Probabilities • What about unseen words? • Add one doesn’t work so well here • We need this • Problems: • We don’t know how many words there are – potentially unbounded! • This adds the same amount of mass for all categories • What categories are likely for an unknown word? • Most likely: Noun, Verb • Least likely: Determiner, Interfection
Smoothing Emission Probabilities • What about unseen words? • Add one doesn’t work so well here • We need this • Problems: • We don’t know how many words there are – potentially unbounded! • This adds the same amount of mass for all categories • What categories are likely for an unknown word? • Most likely: Noun, Verb • Least likely: Determiner, Interfection • Use evidence from words that occur once for unseen words
Smoothing Emission Probabilities • Preprocessing the training corpus: • Count occurrences of all words • Replace words singletons with magic token <UNK> • Gather counts on modified data, estimate parameters • Preprocessing the test set • For each test set word • If seen at least twice in training set, leave it alone • Otherwise replace with <UNK> • Run Viterbi on this modified input
Unknown Words • Is there other information we could use for P(w|t)? • Information in words themselves? • Morphology: • -able: JJ • -tion NN • -ly RB • Case: John NP, etc • Augment models • Add to ‘context’ of tags • Include as features in classifier models • We’ll come back to this idea!
HMM Implementation:Storing an HMM • Approach #1: • Hash table (direct): • πi=