1 / 66

Ling 570 Day 6: HMM POS Taggers

Ling 570 Day 6: HMM POS Taggers. Overview. Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details. HMM POS Tagging. HMM Tagger. : How likely is this tag given n prev tags? Often we use just one previous tag

gefjun
Download Presentation

Ling 570 Day 6: HMM POS Taggers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ling 570 Day 6:HMM POS Taggers

  2. Overview • Open Questions • HMM POS Tagging • Review Viterbi algorithm • Training and Smoothing • HMM Implementation Details

  3. HMM POS Tagging

  4. HMM Tagger : • How likely is this tag given n prev tags? • Often we use just one previous tag • Can model with a tag-tag matrix

  5. HMM Tagger : • The probability of the word given a tag(not vice versa!) • We model this with a word-tag matrix

  6. HMM Tagger Why and not ? • Take the following examples (from J&M): • Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/?? for/IN outer/JJ space/NN

  7. HMM Tagger Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN • Maximize • We can choose between • Pr(VB|TO) x Pr(race|VB) x Pr(NN|VB) • Pr(NN|TO) x Pr(race|NN) x Pr(NN|NN)

  8. The good HMM Tagger • From the Brown/Switchboard corpus: • P(VB|TO) = .34 • P(NN|TO) = .021 • P(race|VB) = .00003 • P(race|NN) = .00041 • P(VB|TO) x P(race|VB) = .34 x .00003 = .00001 • P(NN|TO) x P(race|NN) = .021 x .00041 = .000007  a. TO followed by VB in the context of race is more probable (‘race’ really has no effect here).

  9. HMM Philosophy • Imagine: the author, when creating this sentence, also had in mind the parts-of-speech of each of these words. • After the fact, we’re now trying to recover those parts of speech. • They’re the hidden part of the Markov model.

  10. What happens when we do it the wrong way? • Invert word and tag, P(t|w) instead of P(w|t): • P(VB|race) = .02 • P(NN|race) = .98 • 2 would drown out virtually any other probability! We’d always tag race with NN!

  11. What happens when we do it the wrong way? • Invert word and tag, P(t|w) instead of P(w|t): • P(VB|race) = .02 • P(NN|race) = .98 • 2 would drown out virtually any other probability! We’d always tag race with NN! • Also, it would double-predict every tag: • This is not a well formed model!

  12. N-gram POS tagging N-gram model:

  13. N-gram POS tagging N-gram model: Predict current tag conditioned on prior n-1 tags

  14. N-gram POS tagging N-gram model: Predict current tag conditioned on prior n-1 tags Predict word conditioned on current tag

  15. N-gram POS tagging N-gram model: Bigram model:

  16. N-gram POS tagging N-gram model: Trigram model:

  17. HMM bigram tagger • Consists of • States: POS tags • Observations: words in the vocabulary • Transitions: • Emissions: • Initial distribution:

  18. HMM trigram tagger • Consists of • States: pairs of tags • Observations: still words in the vocabulary • Transition probabilities:,where • Emissions where for some tag • Initial distribution

  19. Training • An HMM needs to be trained on the following: • The initial state probabilities • The state transition probabilities • The tag-tag matrix • The emission probabilities • The tag-word matrix

  20. Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm

  21. Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm Transition distribution

  22. Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm Emission distribution

  23. Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm

  24. Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm

  25. Review Viterbi Algorithm

  26. Consider two examples Mariners hit a home run Mariners hit made the news

  27. Consider two examples N V DT N N Mariners hit a home run N N V DT N Mariners hit made the news

  28. Parameters • As probabilities, they get very small

  29. Parameters • As probabilities, they get very small • As log probabilities, they won’t underflow… • …and we can just add them

  30. Viterbi • Initialization: • Recursion: • Termination:

  31. Pseudocode function Viterbi( observations, states) matrix of matrix of for each state // initialize for each time // update for each state // max final returnRecoverBestSequence(bt, )

  32. Pseudocode function RecoverBestSequence(, , ) path = array() path.add() while () path.add() return reverse(path)

  33. Smoothing

  34. Training • Maximum Likelihood estimates for POS tagging:

  35. Why Smoothing? • Zero counts

  36. Why Smoothing? • Zero counts • Handle missing tag sequences: • Smooth transition probabilities

  37. Why Smoothing? • Zero counts • Handle missing tag sequences: • Smooth transition probabilities • Handle unseen words: • Smooth observation probabilities

  38. Why Smoothing? • Zero counts • Handle missing tag sequences: • Smooth transition probabilities • Handle unseen words: • Smooth observation probabilities • Handle unseen (word,tag) pairs where both are known

  39. Smoothing Tag Sequences • Haven’t seen • How can we estimate?

  40. Smoothing Tag Sequences • Haven’t seen • How can we estimate? • Add some fake counts! • MLE estimate

  41. Smoothing Tag Sequences • Haven’t seen • How can we estimate? • Add some fake counts! • Add one smoothing: • What is ??? if we want a normalized distribution?

  42. Smoothing Tag Sequences • Haven’t seen • How can we estimate? • Add some fake counts! • Add one smoothing: • is the number of tags – then it still sums to 1. • In general this is not a good way to smooth, but it’s enough to get you by for your next assignment.

  43. Smoothing Emission Probabilities • What about unseen words? • Add one doesn’t work so well here • We need this • Problems: • We don’t know how many words there are – potentially unbounded! • This adds the same amount of mass for all categories • What categories are likely for an unknown word? • Most likely: Noun, Verb • Least likely: Determiner, Interfection

  44. Smoothing Emission Probabilities • What about unseen words? • Add one doesn’t work so well here • We need this • Problems: • We don’t know how many words there are – potentially unbounded! • This adds the same amount of mass for all categories • What categories are likely for an unknown word? • Most likely: Noun, Verb • Least likely: Determiner, Interfection • Use evidence from words that occur once for unseen words

  45. Smoothing Emission Probabilities • Preprocessing the training corpus: • Count occurrences of all words • Replace words singletons with magic token <UNK> • Gather counts on modified data, estimate parameters • Preprocessing the test set • For each test set word • If seen at least twice in training set, leave it alone • Otherwise replace with <UNK> • Run Viterbi on this modified input

  46. Unknown Words • Is there other information we could use for P(w|t)? • Information in words themselves? • Morphology: • -able:  JJ • -tion NN • -ly RB • Case: John  NP, etc • Augment models • Add to ‘context’ of tags • Include as features in classifier models • We’ll come back to this idea!

  47. HMM Implementation

  48. HMM Implementation:Storing an HMM • Approach #1: • Hash table (direct): • πi=

More Related