1 / 31

IE With Undirected Models: the saga continues

IE With Undirected Models: the saga continues. William W. Cohen CALD. Announcements. Upcoming assignments: Mon 2/23: Toutanova et al Wed 2/25: Klein & Manning, intro to max margin theory Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page

rholland
Download Presentation

IE With Undirected Models: the saga continues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IE With Undirected Models:the saga continues William W. Cohen CALD

  2. Announcements • Upcoming assignments: • Mon 2/23: Toutanova et al • Wed 2/25: Klein & Manning, intro to max margin theory • Mon 3/1: no writeup due • Wed 3/3: project proposal due: • personnel + 1-2 page • Spring break week, no class

  3. Motivation for CMMs S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

  4. Implications of the model • Does this do what we want? • Q: does Y[i-1] depend on X[i+1] ? • “a nodes is conditionally independent of its non-descendents given its parents”

  5. CRF model y1 y2 y3 y4 x

  6. Dependency Nets

  7. Dependency nets • Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. Pr(y1|x,y2) Pr(y2|x,y1,y2) Pr(y3|x,y2,y4) Pr(y4|x,y3) y1 y2 y3 y4 Learning is local, but inference is not, and need not be unidirectional x

  8. Toutanova, Klein, Manning, Singer • Dependency nets for POS tagging vs CMM’s. • Maxent is used for local conditional model. • Goals: • An easy-to-train bidirectional model • A really good POS tagger

  9. Toutanova et al • Don’t use Gibbs sampling for inference: instead use a Viterbi variant (which is not guaranteed to produce the ML sequence) D = {11, 11, 11, 12, 21, 33} ML state: {11} P(a=1|b=1)P(b=1|a=1) < 1 P(a=3|b=3)P(b=3|a=3) = 1

  10. Results with model Final test-set results MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4

  11. Klein & Manning: Conditional Structure vs Estimation

  12. Task 1: WSD (Word Sense Disambiguation) Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.

  13. Task 1: WSD (Word Sense Disambiguation) Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption

  14. Task 1: WSD (Word Sense Disambiguation) Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:

  15. Task 1: WSD (Word Sense Disambiguation) • Optimize JL with std NB learning • Optimize SCL, CL with conjugate gradient • Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint • I think this makes sure non-conditional version is a valid probability • Don’t even try on optimizing accuracy • Penalty for extreme predictions in SCL

  16. Task 2: POS Tagging • Sequential problem • Replace NB with HMM model. • Standard algorithms maximize joint likelihood • Claim: keeping the same model but maximizing conditional likelihood leads to a CRF • Is this true? • Alternative is conditional structure (CMM)

  17. Using conditional structure vs maximizing conditional likelihood CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)

  18. Task 2: POS Tagging Experiments with a simple feature set: For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM

  19. Error analysis for POS tagging • Label bias is not the issue: • state-state dependencies are weakcompared to observation-state dependencies • too much emphasis on observation, not enough on previous states (“observation bias”) • put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...

  20. Error analysis for POS tagging

  21. Background for next week:the last 20 years of learning theory

  22. Milestones in learning theory • Valiant 1984 CACM: • Turing machines and Turing tests—formal analysis of AI problems • Chernoff bound shows that Prob(error of h>e) => Prob(h consistent with m examples)<d • So given m examples, can afford to examine 2^m hypotheses

  23. Milestones in learning theory • Haussler AAAI86: • Pick a small hypothesis from a large set • Given m examples, can learn hypothesis of size O(m) bits • Blumer,Ehrenfeucht,Haussler,Warmuth, STOC88: • Generalize notion of “hypothesis size” to VC-dimension.

  24. More milestones.... • Littlestone MLJ88: Winnow algorithm • Learning “small” hypothesis in many dimensions, in mistake bounded model • Mistake bound ~= VCdim. • Blum COLT91: • Learning over infinitely many attributes in mistake-bounded model • Learning as compression as learning...

  25. More milestones.... • Freund Schapire 1996: • boosting C4.5, even to extremes, does not overfit data (!?) --how does this reconcile with Occam’s razor? • Vapnik’s support vector machines: • kernel representation of a function • “true” optimization in machine learning • boosting as iterative “margin maximization”

  26. Comments • For bag of words text, R^2=|words in doc| • Vocabulary size matters not

More Related