940 likes | 1.16k Views
MaxEnt POS Tagging. Shallow Processing Techniques for NLP Ling570 November 21, 2011. Roadmap. MaxEnt POS Tagging Features Beam Search vs Viterbi Named Entity Tagging. MaxEnt Feature Template. Words: Current word: w 0 Previous word: w -1 Word two back: w -2 Next word: w +1
E N D
MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011
Roadmap • MaxEnt POS Tagging • Features • Beam Search • vs Viterbi • Named Entity Tagging
MaxEnt Feature Template • Words: • Current word: w0 • Previous word: w-1 • Word two back: w-2 • Next word: w+1 • Next next word: w+2 • Tags: • Previous tag: t-1 • Previous tag pair: t-2t-1 • How many features? 5|V|+|T|+|T|2
Representing Orthographic Patterns • How can we represent morphological patterns as features? • Character sequences • Which sequences? Prefixes/suffixes • e.g. suffix(wi)=ing or prefix(wi)=well • Specific characters or character types • Which? • is-capitalized • is-hyphenated
Examples • well-heeled: rare word
Examples • well-heeled: rare word JJ prevW=about:1 prev2W=stories:1 nextW=communities:1 next2W=and:1 pref=w:1 pref=we:1 pref=wel:1 pref=well:1 suff=d:1 suff=ed:1 suff=led:1 suff=eled:1 is-hyphenated:1 preT=IN:1 pre2T=NNS-IN:1
Finding Features • In training, where do features come from? • Where do features come from in testing?
Finding Features • In training, where do features come from? • Where do features come from in testing? • tag features come from classification of prior word
Sequence Labeling • Goal: Find most probable labeling of a sequence • Many sequence labeling tasks • POS tagging • Word segmentation • Named entity tagging • Story/spoken sentence segmentation • Pitch accent detection • Dialog act tagging
Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM
Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM • Via classification: Use classification algorithm • Issue: What about tag features?
Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM • Via classification: Use classification algorithm • Issue: What about tag features? • Features that use class labels – depend on classification • Solutions:
Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM • Via classification: Use classification algorithm • Issue: What about tag features? • Features that use class labels – depend on classification • Solutions: • Don’t use features that depend on class labels (loses info)
Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM • Via classification: Use classification algorithm • Issue: What about tag features? • Features that use class labels – depend on classification • Solutions: • Don’t use features that depend on class labels (loses info) • Use other process to generate class labels, then use
Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM • Via classification: Use classification algorithm • Issue: What about tag features? • Features that use class labels – depend on classification • Solutions: • Don’t use features that depend on class labels (loses info) • Use other process to generate class labels, then use • Perform incremental classification to get labels, use labels as features for instances later in sequence
HMM Trellis <s> time flies like an arrow N N N N N V V V V V BOS 0 P P P P P D D D D D Adapted from F. Xia
Viterbi • Initialization: • Recursion: • Termination:
Decoding • Goal: Identify highest probability tag sequence
Decoding • Goal: Identify highest probability tag sequence • Issues: • Features include tags from previous words • Not immediately available
Decoding • Goal: Identify highest probability tag sequence • Issues: • Features include tags from previous words • Not immediately available • Uses tag history • Just knowing highest probability preceding tag insufficient
Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices
Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences?
Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • All sequences?
Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • All sequences? • No. Why not?
Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • All sequences? • No. Why not? • How many sequences?
Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • All sequences? • No. Why not? • How many sequences? • Branching factor: N (# tags); Depth: T (# words) • NT
Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • All sequences? • No. Why not? • How many sequences? • Branching factor: N (# tags); Depth: T (# words) • NT • Top K highest probability sequences
Breadth-First Search <s> time flies like an arrow BOS
Breadth-First Search <s> time flies like an arrow N BOS V
Breadth-First Search <s> time flies like an arrow N N V BOS N V V
Breadth-First Search <s> time flies like an arrow P N V N P V V BOS P N V V V P V
Breadth-First Search <s> time flies like an arrow P N V N P V V BOS P N V V V P V
Breadth-First Search <s> time flies like an arrow P N V N P V V BOS P N V V V P V
Breadth-first Search • Is breadth-first search efficient?
Breadth-first Search • Is it efficient? • No, it tries everything
Beam Search • Intuition: • Breadth-first search explores all paths
Beam Search • Intuition: • Breadth-first search explores all paths • Lots of paths are (pretty obviously) bad • Why explore bad paths?
Beam Search • Intuition: • Breadth-first search explores all paths • Lots of paths are (pretty obviously) bad • Why explore bad paths? • Restrict to (apparently best) paths • Approach: • Perform breadth-first search, but
Beam Search • Intuition: • Breadth-first search explores all paths • Lots of paths are (pretty obviously) bad • Why explore bad paths? • Restrict to (apparently best) paths • Approach: • Perform breadth-first search, but • Retain only k ‘best’ paths thus far • k: beam width
Beam Search, k=3 <s> time flies like an arrow BOS
Beam Search, k=3 <s> time flies like an arrow N BOS V
Beam Search, k=3 <s> time flies like an arrow N N V BOS N V V
Beam Search, k=3 <s> time flies like an arrow P N V N P V V BOS P N V V V
Beam Search, k=3 <s> time flies like an arrow P N V N P V V BOS P N 56 V V V
Beam Search • W={w1,w2,…,wn}: test sentence
Beam Search • W={w1,w2,…,wn}: test sentence • sij: jth highest prob. sequence up to & inc. word wi