Ling 570 Day 6: HMM POS Taggers

Ling 570 Day 6:HMM POS Taggers

Overview • Open Questions • HMM POS Tagging • Review Viterbi algorithm • Training and Smoothing • HMM Implementation Details

HMM POS Tagging

HMM Tagger : • How likely is this tag given n prev tags? • Often we use just one previous tag • Can model with a tag-tag matrix

HMM Tagger : • The probability of the word given a tag(not vice versa!) • We model this with a word-tag matrix

HMM Tagger Why and not ? • Take the following examples (from J&M): • Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/?? for/IN outer/JJ space/NN

HMM Philosophy • Imagine: the author, when creating this sentence, also had in mind the parts-of-speech of each of these words. • After the fact, we’re now trying to recover those parts of speech. • They’re the hidden part of the Markov model.

What happens when we do it the wrong way? • Invert word and tag, P(t|w) instead of P(w|t): • P(VB|race) = .02 • P(NN|race) = .98 • 2 would drown out virtually any other probability! We’d always tag race with NN!

What happens when we do it the wrong way? • Invert word and tag, P(t|w) instead of P(w|t): • P(VB|race) = .02 • P(NN|race) = .98 • 2 would drown out virtually any other probability! We’d always tag race with NN! • Also, it would double-predict every tag: • This is not a well formed model!

N-gram POS tagging N-gram model:

N-gram POS tagging N-gram model: Predict current tag conditioned on prior n-1 tags

N-gram POS tagging N-gram model: Predict current tag conditioned on prior n-1 tags Predict word conditioned on current tag

N-gram POS tagging N-gram model: Bigram model:

N-gram POS tagging N-gram model: Trigram model:

HMM bigram tagger • Consists of • States: POS tags • Observations: words in the vocabulary • Transitions: • Emissions: • Initial distribution:

HMM trigram tagger • Consists of • States: pairs of tags • Observations: still words in the vocabulary • Transition probabilities:,where • Emissions where for some tag • Initial distribution

Training • An HMM needs to be trained on the following: • The initial state probabilities • The state transition probabilities • The tag-tag matrix • The emission probabilities • The tag-word matrix

Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm

Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm Transition distribution

Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm Emission distribution

Implementation • Once trained, model assigns probabilities to POS-tagged word sequences • To tag a new sentence, we want to find the best sequence of POS tags • We use the Viterbi algorithm

Review Viterbi Algorithm

Consider two examples Mariners hit a home run Mariners hit made the news

Consider two examples N V DT N N Mariners hit a home run N N V DT N Mariners hit made the news

Parameters • As probabilities, they get very small

Parameters • As probabilities, they get very small • As log probabilities, they won’t underflow… • …and we can just add them

Viterbi • Initialization: • Recursion: • Termination:

Pseudocode function Viterbi( observations, states) matrix of matrix of for each state // initialize for each time // update for each state // max final returnRecoverBestSequence(bt, )

Pseudocode function RecoverBestSequence(, , ) path = array() path.add() while () path.add() return reverse(path)

Smoothing

Training • Maximum Likelihood estimates for POS tagging:

Why Smoothing? • Zero counts

Why Smoothing? • Zero counts • Handle missing tag sequences: • Smooth transition probabilities

Why Smoothing? • Zero counts • Handle missing tag sequences: • Smooth transition probabilities • Handle unseen words: • Smooth observation probabilities

Why Smoothing? • Zero counts • Handle missing tag sequences: • Smooth transition probabilities • Handle unseen words: • Smooth observation probabilities • Handle unseen (word,tag) pairs where both are known

Smoothing Tag Sequences • Haven’t seen • How can we estimate?

Smoothing Tag Sequences • Haven’t seen • How can we estimate? • Add some fake counts! • MLE estimate

Smoothing Tag Sequences • Haven’t seen • How can we estimate? • Add some fake counts! • Add one smoothing: • What is ??? if we want a normalized distribution?

Smoothing Tag Sequences • Haven’t seen • How can we estimate? • Add some fake counts! • Add one smoothing: • is the number of tags – then it still sums to 1. • In general this is not a good way to smooth, but it’s enough to get you by for your next assignment.

Smoothing Emission Probabilities • What about unseen words? • Add one doesn’t work so well here • We need this • Problems: • We don’t know how many words there are – potentially unbounded! • This adds the same amount of mass for all categories • What categories are likely for an unknown word? • Most likely: Noun, Verb • Least likely: Determiner, Interfection

Smoothing Emission Probabilities • What about unseen words? • Add one doesn’t work so well here • We need this • Problems: • We don’t know how many words there are – potentially unbounded! • This adds the same amount of mass for all categories • What categories are likely for an unknown word? • Most likely: Noun, Verb • Least likely: Determiner, Interfection • Use evidence from words that occur once for unseen words

Smoothing Emission Probabilities • Preprocessing the training corpus: • Count occurrences of all words • Replace words singletons with magic token <UNK> • Gather counts on modified data, estimate parameters • Preprocessing the test set • For each test set word • If seen at least twice in training set, leave it alone • Otherwise replace with <UNK> • Run Viterbi on this modified input

Unknown Words • Is there other information we could use for P(w|t)? • Information in words themselves? • Morphology: • -able:  JJ • -tion NN • -ly RB • Case: John  NP, etc • Augment models • Add to ‘context’ of tags • Include as features in classifier models • We’ll come back to this idea!

HMM Implementation

HMM Implementation:Storing an HMM • Approach #1: • Hash table (direct): • πi=

Ling 570 Day 6: HMM POS Taggers

Ling 570 Day 6: HMM POS Taggers

Presentation Transcript

Machine Learning PoS-Taggers

Day 6

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

Ling 570 Day 9: Text Classification and Sentiment Analysis

Ling 570

Ling 570

Ling 570 Day 17: Named Entity Recognition Chunking

HMM Algorithms

Ling 570: Day 8 Classification, Mallet

Ling 570 Day 16 : Sequence modeling Named Entity Recognition

PoS tagging and Chunking with HMM and CRF

Hmm…

Ling 570

DAY 6

HMM – HMM Comparison

Machine Learning PoS-Taggers

Wardhaugh – Chapter 6 – LING VARIATION

570

Ling 570 Day 7: Classifiers

POS Tagging HMM Taggers

HMM structure:

PSYCH 570 Week 6 DQ 1