MaxEnt POS Tagging

MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011

Roadmap • MaxEnt POS Tagging • Features • Beam Search • vs Viterbi • Named Entity Tagging

MaxEnt Feature Template • Words: • Current word: w0 • Previous word: w-1 • Word two back: w-2 • Next word: w+1 • Next next word: w+2 • Tags: • Previous tag: t-1 • Previous tag pair: t-2t-1 • How many features? 5|V|+|T|+|T|2

Representing Orthographic Patterns • How can we represent morphological patterns as features? • Character sequences • Which sequences? Prefixes/suffixes • e.g. suffix(wi)=ing or prefix(wi)=well • Specific characters or character types • Which? • is-capitalized • is-hyphenated

MaxEnt Feature Set

Examples • well-heeled: rare word

Examples • well-heeled: rare word JJ prevW=about:1 prev2W=stories:1 nextW=communities:1 next2W=and:1 pref=w:1 pref=we:1 pref=wel:1 pref=well:1 suff=d:1 suff=ed:1 suff=led:1 suff=eled:1 is-hyphenated:1 preT=IN:1 pre2T=NNS-IN:1

Finding Features • In training, where do features come from? • Where do features come from in testing?

Finding Features • In training, where do features come from? • Where do features come from in testing? • tag features come from classification of prior word

Sequence Labeling

Sequence Labeling • Goal: Find most probable labeling of a sequence • Many sequence labeling tasks • POS tagging • Word segmentation • Named entity tagging • Story/spoken sentence segmentation • Pitch accent detection • Dialog act tagging

Solving Sequence Labeling

Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM

Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM • Via classification: Use classification algorithm • Issue: What about tag features?

Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM • Via classification: Use classification algorithm • Issue: What about tag features? • Features that use class labels – depend on classification • Solutions:

Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM • Via classification: Use classification algorithm • Issue: What about tag features? • Features that use class labels – depend on classification • Solutions: • Don’t use features that depend on class labels (loses info)

Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM • Via classification: Use classification algorithm • Issue: What about tag features? • Features that use class labels – depend on classification • Solutions: • Don’t use features that depend on class labels (loses info) • Use other process to generate class labels, then use

Solving Sequence Labeling • Direct: Use a sequence labeling algorithm • E.g. HMM, CRF, MEMM • Via classification: Use classification algorithm • Issue: What about tag features? • Features that use class labels – depend on classification • Solutions: • Don’t use features that depend on class labels (loses info) • Use other process to generate class labels, then use • Perform incremental classification to get labels, use labels as features for instances later in sequence

HMM Trellis <s> time flies like an arrow N N N N N V V V V V BOS 0 P P P P P D D D D D Adapted from F. Xia

Viterbi • Initialization: • Recursion: • Termination:

Decoding • Goal: Identify highest probability tag sequence

Decoding • Goal: Identify highest probability tag sequence • Issues: • Features include tags from previous words • Not immediately available

Decoding • Goal: Identify highest probability tag sequence • Issues: • Features include tags from previous words • Not immediately available • Uses tag history • Just knowing highest probability preceding tag insufficient

Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices

Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences?

Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • All sequences?

Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • All sequences? • No. Why not?

Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • All sequences? • No. Why not? • How many sequences?

Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • All sequences? • No. Why not? • How many sequences? • Branching factor: N (# tags); Depth: T (# words) • NT

Decoding • Approach: Retain multiple candidate tag sequences • Essentially search through tagging choices • Which sequences? • All sequences? • No. Why not? • How many sequences? • Branching factor: N (# tags); Depth: T (# words) • NT • Top K highest probability sequences

Breadth-First Search <s> time flies like an arrow BOS

Breadth-First Search <s> time flies like an arrow N BOS V

Breadth-First Search <s> time flies like an arrow N N V BOS N V V

Breadth-First Search <s> time flies like an arrow P N V N P V V BOS P N V V V P V

Breadth-first Search • Is breadth-first search efficient?

Breadth-first Search • Is it efficient? • No, it tries everything

Beam Search • Intuition: • Breadth-first search explores all paths

Beam Search • Intuition: • Breadth-first search explores all paths • Lots of paths are (pretty obviously) bad • Why explore bad paths?

Beam Search • Intuition: • Breadth-first search explores all paths • Lots of paths are (pretty obviously) bad • Why explore bad paths? • Restrict to (apparently best) paths • Approach: • Perform breadth-first search, but

Beam Search • Intuition: • Breadth-first search explores all paths • Lots of paths are (pretty obviously) bad • Why explore bad paths? • Restrict to (apparently best) paths • Approach: • Perform breadth-first search, but • Retain only k ‘best’ paths thus far • k: beam width

Beam Search, k=3 <s> time flies like an arrow BOS

Beam Search, k=3 <s> time flies like an arrow N BOS V

Beam Search, k=3 <s> time flies like an arrow N N V BOS N V V

Beam Search, k=3 <s> time flies like an arrow P N V N P V V BOS P N V V V

Beam Search, k=3 <s> time flies like an arrow P N V N P V V BOS P N 56 V V V

Beam Search • W={w1,w2,…,wn}: test sentence

Beam Search • W={w1,w2,…,wn}: test sentence • sij: jth highest prob. sequence up to & inc. word wi

MaxEnt POS Tagging