310 likes | 321 Views
IE With Undirected Models: the saga continues. William W. Cohen CALD. Announcements. Upcoming assignments: Mon 2/23: Toutanova et al Wed 2/25: Klein & Manning, intro to max margin theory Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page
E N D
IE With Undirected Models:the saga continues William W. Cohen CALD
Announcements • Upcoming assignments: • Mon 2/23: Toutanova et al • Wed 2/25: Klein & Manning, intro to max margin theory • Mon 3/1: no writeup due • Wed 3/3: project proposal due: • personnel + 1-2 page • Spring break week, no class
Motivation for CMMs S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state
Implications of the model • Does this do what we want? • Q: does Y[i-1] depend on X[i+1] ? • “a nodes is conditionally independent of its non-descendents given its parents”
CRF model y1 y2 y3 y4 x
Dependency nets • Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. Pr(y1|x,y2) Pr(y2|x,y1,y2) Pr(y3|x,y2,y4) Pr(y4|x,y3) y1 y2 y3 y4 Learning is local, but inference is not, and need not be unidirectional x
Toutanova, Klein, Manning, Singer • Dependency nets for POS tagging vs CMM’s. • Maxent is used for local conditional model. • Goals: • An easy-to-train bidirectional model • A really good POS tagger
Toutanova et al • Don’t use Gibbs sampling for inference: instead use a Viterbi variant (which is not guaranteed to produce the ML sequence) D = {11, 11, 11, 12, 21, 33} ML state: {11} P(a=1|b=1)P(b=1|a=1) < 1 P(a=3|b=3)P(b=3|a=3) = 1
Results with model Final test-set results MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4
Task 1: WSD (Word Sense Disambiguation) Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.
Task 1: WSD (Word Sense Disambiguation) Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption
Task 1: WSD (Word Sense Disambiguation) Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:
Task 1: WSD (Word Sense Disambiguation) • Optimize JL with std NB learning • Optimize SCL, CL with conjugate gradient • Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint • I think this makes sure non-conditional version is a valid probability • Don’t even try on optimizing accuracy • Penalty for extreme predictions in SCL
Task 2: POS Tagging • Sequential problem • Replace NB with HMM model. • Standard algorithms maximize joint likelihood • Claim: keeping the same model but maximizing conditional likelihood leads to a CRF • Is this true? • Alternative is conditional structure (CMM)
Using conditional structure vs maximizing conditional likelihood CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)
Task 2: POS Tagging Experiments with a simple feature set: For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM
Error analysis for POS tagging • Label bias is not the issue: • state-state dependencies are weakcompared to observation-state dependencies • too much emphasis on observation, not enough on previous states (“observation bias”) • put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...
Background for next week:the last 20 years of learning theory
Milestones in learning theory • Valiant 1984 CACM: • Turing machines and Turing tests—formal analysis of AI problems • Chernoff bound shows that Prob(error of h>e) => Prob(h consistent with m examples)<d • So given m examples, can afford to examine 2^m hypotheses
Milestones in learning theory • Haussler AAAI86: • Pick a small hypothesis from a large set • Given m examples, can learn hypothesis of size O(m) bits • Blumer,Ehrenfeucht,Haussler,Warmuth, STOC88: • Generalize notion of “hypothesis size” to VC-dimension.
More milestones.... • Littlestone MLJ88: Winnow algorithm • Learning “small” hypothesis in many dimensions, in mistake bounded model • Mistake bound ~= VCdim. • Blum COLT91: • Learning over infinitely many attributes in mistake-bounded model • Learning as compression as learning...
More milestones.... • Freund Schapire 1996: • boosting C4.5, even to extremes, does not overfit data (!?) --how does this reconcile with Occam’s razor? • Vapnik’s support vector machines: • kernel representation of a function • “true” optimization in machine learning • boosting as iterative “margin maximization”
Comments • For bag of words text, R^2=|words in doc| • Vocabulary size matters not