300 likes | 460 Views
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License . CS 479, section 1: Natural Language Processing. Lecture # 23: Part of Speech Tagging, Hidden Markov Models (cont.).
E N D
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, section 1:Natural Language Processing Lecture #23: Part of Speech Tagging, Hidden Markov Models (cont.) Thanks to Dan Klein of UC Berkeley for some of the materials used in this lecture.
Announcements • Project #2, Part 2 • Early: today • Due: Monday • Thanks for your ongoing feedback • Coming up: Project #3 • Feature engineering for POS tagging • Reading Report #10on the tagging paper by Toutanova and Manning • Due: Wednesday • Mid-term exam Question #4 • Optional practice exercises can replace q. #4!
Objectives • Use Hidden Markov Models efficiently for POS tagging • Understand the Viterbi algorithm • If time permits, think about using HMM as a language model
Possible Scenarios Training Test / Run-time Unlabeled data* Partially labeled data • Labeled data* • Unlabeled data • Partially labeled data * We’re focusing on straight-forwardsupervised learning for now.
Disambiguation • Tagging is disambiguation: Finding the best tag sequence • Given an HMM (i.e., distributions for transitions and emissions), we can score any word sequence and tag sequence together (jointly): • What is the score? NNP VBZ NN NNS CD NN . STOP Fed raises interest rates 0.5 percent .
Option 1: Enumeration • Exhaustive enumeration: list all possible tag sequences, score each one, pick the best one (the “Viterbi” state sequence) Fed raises interest rates … • What’s wrong with this approach? • What to do about it? logP = -23 NNP VBZ NN NNS CD NN NNP NNS NN NNS CD NN logP = -29 NNP VBZ VB NNS CD NN logP = -27 …
Option 2: Breadth-First Search • Exhaustive tree-structured search • Start with just the single empty tagged sentence • At each derivation step, consider all extensions of previous hypotheses • What’s wrong with this approach? • What to do about it? Fed:NNPraises:NNSinterest:NN Fed:NNPraises:NNS Fed:NNPraises:NNSinterest:VB Fed:NNP Fed:NNP raises:VBZ <> Fed:VBN Fed:VBN raises:NNS Fed:VBD Fed:VBN raises:VBZ Fed raises interest
Possibilities • Best-first search • A* • Close cousin: • Branch and Bound • Also possible: • Dynamic programming – we’ll focus our efforts here for now • Fast, approximate searches
Option 3: Dynamic Programming Recall the main ideas: • Typically: Solve an optimization problem • Devise a minimal description (address) for any problem instance and sub-problem • Divide problems into sub-problems: define the recurrence to specify the relationship of problems to sub-problems • Check that the optimality property holds: An optimal solution to a problem is built from optimal solutions to sub-problems. • Store results – typically in a table – and re-use the solutions to sub-problems in the table as you build up to the overall solution. • Back-trace / analyze the table to extract the composition of the final solution.
Partial Taggings as Paths • Step #2: Devise a minimal description (address) for any problem instance and sub-problem • At address (𝑡, 𝑖): Optimal tagging of tokens 1 through 𝑖, ending in tag 𝑡. • Partial taggings from position 1 through are represented as paths through a trellis of tag states, ending in some tag state. • Each arc connects two tag states and has a weight, which is the combination of: and NNS:2 NN:3 NNP:1 :0 VBN:1 VB:3 VBZ:2 Fed raises interest
The Recurrence • Step #3: Divide problems into sub-problems: define the recurrence to specify the relationship of problems to sub-problems • Score of a best path (tagging) up to position ending in state : • Step-by-step: • Also store a back-pointer:
Key idea for the Viterbi algorithm Optimality Property • Step #4: Check that the optimality property holds: An optimal solution to a problem is built from optimal solutions to sub-problems. • What does the optimality property have to say about computingthe score of the best tagging ending in the highlighted state? • The optimal tagging for is constructed from optimal taggings for NNS:2 NN:3 NNP:1 :0 VBN:1 VB:3 VBZ:2 Fed raises interest
Iterative Algorithm • Step #5: Store results – typically in a table – and re-use the solutions to sub-problems in the table as you build up to the overall solution. • For each state and word position , compute and : For = 1 to do: For each in tag-set do:
Viterbi: DP Table • Step #6: Back-trace / analyze the table to extract the composition of the final solution. … VB NN VBZ NNS NNP VBN Fed raises interest …
Viterbi: DP Table … VB NN VBZ NNS NNP VBN Fed raises interest …
Viterbi: DP Table … VB NN VBZ NNS NNP VBN Fed raises interest …
Viterbi: DP Table Start at max andbacktrace … VB NN VBZ NNS NNP VBN Fed raises interest …
Viterbi: DP Table Start at max andbacktrace … VB NN VBZ Read off the best sequence: NNS NNP VBN Fed raises interest …
Efficiency … VB NN VBZ NNS NNP VBN Fed raises interest …
Implementation Trick • For higher-order (>1) HMMs, we need to facilitate dynamic programming • Allow scores to be computed locally without worrying about earlier structure of the trellis • Define one state for each tuple of tags • E.g., for order 2, = • This corresponds to a constrained 1st order HMM over tag n-grams
Implementation Trick • A constrained 1st order HMM over tag n-grams: • Constraints: s0 s1 s2 sn <,> < , t1> < t1, t2> < tn-1, tn> w1 w2 wn
How Well Does It Work? • Choose the most common tag • 90.3% with a bad unknown word model • 93.7% with a good one! • TnT (Brants, 2000): • A carefully smoothed trigram tagger • 96.7% on WSJ text • State-of-Art is approx. 97.3% • Noise in the data • Many errors in the training and test corpora • Probably about 2% guaranteed error from noise (on this data) JJ JJ NN chief executive officer NN JJ NN chief executive officer JJ NN NN chief executive officer NN NN NN chief executive officer DT NN IN NNVBDNNS VBD The average of interbankofferedrates plummeted …
NNP,NNS:2 NNS,NN:3 ,NNP:1 NNP,VBZ:2 NNS,VB:3 ,:0 VBN,NNS:2 VBZ,NN:3 ,VBN:1 VBN,VBZ:2 VBZ,VB:3 Option 4: Beam Search • Adopt a simple pruning strategy in the Viterbi trellis: • A beam is a set of partial hypotheses: • Keep top k (use a priority queue) • Keep those within a factor of the best • Keep widest variety • or some combination of the above • Is this OK? • Beam search works well in practice • Will give you the optimal answer if the beam is wide enough • … and you need optimal answers to validate your beam search Fed raises interest
HMMs: Basic Problems and Solutions Given an input sequence and an HMM , • Decoding / Tagging: Choose optimal tag sequence for using • Solution: Viterbi Algorithm • Evaluation / Computing the marginal probability: Compute , the marginal probability of using • Solution: Forward Algorithm • Just use a sum instead of max in the Viterbi algorithm! Re-estimation / Unsupervised or Semi-supervised Training: Adjust to maximize , the probability of an entire data set, either entirely or partially unlabeled, respectively • Solution: Baum-Welch algorithm • Also known as the Forward-Backward algorithm • Another example of EM
Speech Recognition Revisited Given an input sequence of acoustic feature vectors and a (composed) HMM, • Decoding / Tagging: Choose optimal word (tag) sequence for feature sequence • Solution: Viterbi Algorithm
Next • Better Tagging Features using Maxent • Dealing with unknown words • Adjacent words • Longer-distance features • Soon: Named-Entity Recognition
How’s the HMM as a LM? • Back to tags t for the moment (instead of states s). • We have a generative model of tagged sentences: • We can turn this into a distribution over sentences by marginalizing out the tag sequences: • Problem: too many sequences! • And beam search isn’t going to help this time. Why not?
How’s the HMM as a LM? • POS tagging HMMs are terrible as LMs! • Don’t capture long-distance effects like a parser could • Don’t capture local collocational effects like n-grams • But other HMM-like LMs can work very well I bought an ice cream ___ The computer that I set up yesterday just ___ c1 c2 cn START w1 w2 wn