1 / 22

Klein and Manning on CRFs vs CMMs

Klein and Manning on CRFs vs CMMs. 2-20-2007. Announcements. Projects are all in and approved People working on weird datasets should send me some examples so I understand the task. Anything else?. MEMMs:

quyn-hill
Download Presentation

Klein and Manning on CRFs vs CMMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Klein and Manning on CRFs vs CMMs 2-20-2007

  2. Announcements • Projects are all in and approved • People working on weird datasets should send me some examples so I understand the task. • Anything else?

  3. MEMMs: Sequence classification f:xy is reduced to many cases of ordinary classification, f:xiyi …combined with Viterbi or beam search CRFs: Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute some best y|x x1 x2 x3 x4 x5 x6 MRF: φ(Y1,Y2), φ(Y2,Y3),…. y1 y2 y3 y4 y5 y6 CRF wrap-up x1 x2 x3 x4 x5 x6 … Pr(Y|x2,y1’) Pr(Y|x4,y3) … Pr(Y|x5,y5) Pr(Y|x2,y1) … y1 y2 y3 y4 y5 y6

  4. CRFs: Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute some best y|x x1 x2 x3 x4 x5 x6 MRF: φ(Y1,Y2), φ(Y2,Y3),…. y1 y2 y3 y4 y5 y6 CRF wrap-up

  5. CRFs: Sequence classification f:xy is done by: Converting x,Y to a MRF Using “flow” computations on the MRF to compute some best y|x x1 x2 x3 x4 x5 x6 MRF: φ(Y1,Y2), φ(Y2,Y3),…. y1 y2 y3 y4 y5 y6 CRF wrap-up ?

  6. CRF wrap-up

  7. CRF wrap-up

  8. Klein & Manning: Conditional Structure vs Estimation

  9. Task 1: WSD (Word Sense Disambiguation) Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.

  10. Task 1: WSD (Word Sense Disambiguation) Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption

  11. Task 1: WSD (Word Sense Disambiguation) Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:

  12. In other words… MaxEnt Naïve Bayes Different “optimization goals”… … or, dropping a constraint about f’s and λ’s

  13. Task 1: WSD (Word Sense Disambiguation) • Optimize JL with std NB learning • Optimize SCL, CL with conjugate gradient • Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint • I think this makes sure non-conditional version is a valid probability • “Punt” on optimizing accuracy • Penalty for extreme predictions in SCL

  14. Conclusion: maxent beats NB? All generalizations are wrong?

  15. Task 2: POS Tagging • Sequential problem • Replace NB with HMM model. • Standard algorithms maximize joint likelihood • Claim: keeping the same model but maximizing conditional likelihood leads to a CRF • Is this true? • Alternative is conditional structure (CMM)

  16. CRF HMM

  17. Using conditional structure vs maximizing conditional likelihood CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)

  18. Task 2: POS Tagging Experiments with a simple feature set: For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM

  19. Error analysis for POS tagging • Label bias is not the issue: • state-state dependencies are weakcompared to observation-state dependencies • too much emphasis on observation, not enough on previous states (“observation bias”) • put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...

  20. Error analysis for POS tagging

  21. Next: Cohen & Carvalho(from IJCAI-2005, Edinburgh)

More Related