220 likes | 332 Views
Klein and Manning on CRFs vs CMMs. 2-20-2007. Announcements. Projects are all in and approved People working on weird datasets should send me some examples so I understand the task. Anything else?. MEMMs:
E N D
Klein and Manning on CRFs vs CMMs 2-20-2007
Announcements • Projects are all in and approved • People working on weird datasets should send me some examples so I understand the task. • Anything else?
MEMMs: Sequence classification f:xy is reduced to many cases of ordinary classification, f:xiyi …combined with Viterbi or beam search CRFs: Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute some best y|x x1 x2 x3 x4 x5 x6 MRF: φ(Y1,Y2), φ(Y2,Y3),…. y1 y2 y3 y4 y5 y6 CRF wrap-up x1 x2 x3 x4 x5 x6 … Pr(Y|x2,y1’) Pr(Y|x4,y3) … Pr(Y|x5,y5) Pr(Y|x2,y1) … y1 y2 y3 y4 y5 y6
CRFs: Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute some best y|x x1 x2 x3 x4 x5 x6 MRF: φ(Y1,Y2), φ(Y2,Y3),…. y1 y2 y3 y4 y5 y6 CRF wrap-up
CRFs: Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute some best y|x x1 x2 x3 x4 x5 x6 MRF: φ(Y1,Y2), φ(Y2,Y3),…. y1 y2 y3 y4 y5 y6 CRF wrap-up ?
Task 1: WSD (Word Sense Disambiguation) Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.
Task 1: WSD (Word Sense Disambiguation) Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption
Task 1: WSD (Word Sense Disambiguation) Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:
In other words… MaxEnt Naïve Bayes Different “optimization goals”… … or, dropping a constraint about f’s and λ’s
Task 1: WSD (Word Sense Disambiguation) • Optimize JL with std NB learning • Optimize SCL, CL with conjugate gradient • Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint • I think this makes sure non-conditional version is a valid probability • “Punt” on optimizing accuracy • Penalty for extreme predictions in SCL
Conclusion: maxent beats NB? All generalizations are wrong?
Task 2: POS Tagging • Sequential problem • Replace NB with HMM model. • Standard algorithms maximize joint likelihood • Claim: keeping the same model but maximizing conditional likelihood leads to a CRF • Is this true? • Alternative is conditional structure (CMM)
CRF HMM
Using conditional structure vs maximizing conditional likelihood CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)
Task 2: POS Tagging Experiments with a simple feature set: For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM
Error analysis for POS tagging • Label bias is not the issue: • state-state dependencies are weakcompared to observation-state dependencies • too much emphasis on observation, not enough on previous states (“observation bias”) • put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...