480 likes | 585 Views
Inference and Learning via Integer Linear Programming. Vasin,Dan,Scott,Dav. Outline. Problem Definition Integer Linear Programming (ILP) Its generality Learning and Inference via ILP Experiments Extension to hierarchical learning Future Direction Hidden Variables. Problem Definition.
E N D
Inference and Learning via Integer Linear Programming Vasin,Dan,Scott,Dav
Outline • Problem Definition • Integer Linear Programming (ILP) • Its generality • Learning and Inference via ILP • Experiments • Extension to hierarchical learning • Future Direction • Hidden Variables
Problem Definition • X = (X1,...,Xk) X k = X • Y = (Y1,...,Yl) Y l = Y • Given X = x, find Y = y • Notation agreements • Capital letters mean variables • Non-capital letters mean values • Bold indicates vectors or matrixes • X,Yis a set
Example (Text Chunking) y = NP ADJP VP ADVP VP x = The guy presenting now is so tired
Classifiers • A classifier • h: X Y (l-1)Y {1,..,l} R • Example • score(x,y-3,NP,3) = 0.3 • score(x,y-3,VP,3) = 0.5 • score(x,y-3,ADVP,3) = 0.2 • score(x,y-3,ADJP,3) = 1.2 • score(x,y-3,NULL,3) = 0.1
Inference • Goal: x y • Given • x input • score(x,y-t,y,t) for all (y-t,y) Y l, t {1,..,l} • C A set of constraints over Y • Find y • maximizes global function • score(x,y) = t score(x,y-t,yt,t) • satisfies constraints C
Integer Linear Programming • Boolean variables: U = (U1,...,Ud) {0,1}d • Cost vector: p = (p1,…,pd) Rd • Cost Function: pU • Constraint Matrix: cReRd • Maximize pU • Subject to cU 0 (cU=0, cU3, possible)
ILP (Example) • U = (U1,U2,U3) • p = (0.3, 0.5, 0.8) • c = 1 2 3 -1 -2 2 0 -3 2 • Maximize pU • Subject to cU 0
Boolean Functions as Linear Constraints • Conjunction • U1U2U3 U1=1, U2=1, U3=1 • Disjunction • U1U2U3 U1 + U2 + U3 1 • CNF • (U1U2)(U3U4) U1+U2 1, U3+U4 1
Text Chunking • Indicator Variables • U1,NP,U1,NULL,U2,VP... y1=NP, y1=NULL, y2=VP,.. • U1,NP indicates that phrase 1 is labeled NP • Cost Vector • p1,NP = score(x,NP,1) • p1,NULL = score(x,NULL,1) • p2,VP = score(x,VP,2) • ... • pU = score(x,y) = t score(x,yt,t), subject to constraints
Structural Constraints • Coherency • yt can take only one value • y{NP,..NULL} Ut,y = 1 • Non-Overlapping • y1 and y2 overlap • U1,NULL + U2,NULL = 1
Linguistic Constraints • Every sentence must have at least one VP • t Ut,VP 1 • Every sentence must have at least one NP • t Ut,NP 1 • ...
Interacting Classifiers • Classifier for an output yt uses other outputs y-t as inputs • score(x,y-t,y,t) • Need to ensure that the final output from ILP computed from a consistent y • Introduce additional variables • Introduce additional coherency constraints
Interacting Classifiers • Additional variables • Y=yUY,y for all possible y-t,y • Additional coherency constraints • UY,y = 1 iff Ut,yt = 1 for all yt in y • yt in yUt,yt - UY,y l – 1 • yt in yUt,yt - lUY,y 0
Learning Classifiers • score(x,y-t,y,t) = yy(x,y-t,t) • Learn y, for all y Y • Multi-class learning • Example (x,y) {y(x,y-t,t),yt}t=1..l • Learn each classifier independently
Learn with Inference Feedback • Learn by observing global behavior • For each example (x,y) • Make prediction with the current classifiers and ILP • y’ = argmaxytscore(x,y-t,y,t) • For each t, update • If y’t yt • Promote score(x,y-t,yt,t) • Demote score(x,y’-t,y’t,t)
Experiments • Semantics Role Labeling • Assume correct boundaries are given • Only sentences with more than 5 arguments are included
Experimental Results Winnow Perceptron • For difficult task: • Inference feedback during training improves performance • For easy task: • Learning without inference feedback is better
Conservative Updating • Update only if necessary • Example • U1 + U2 = 1 • Predict (U1, U2) = (1,0) • Correct (U1, U2) = (0,1) • Feedback Demote class 1, promote class 2 • So, U1=0 U2=1, so only demote class 1
Conservative Updating • S = minset(Constraints) • Set of functions that, if changed, would make global prediction correct. • Promote (Demote) only those functions in the minset S
Hierarchical Learning • Given x • Compute hierarchically • z1 = h1(x) • z2 = h2(x,z1) • … • y = hs+1(x,z1,…,zs) • Assume all z are known in training
Hierarchical Learning • Assume each hj can be computed via ILP • pj, Uj, cj • y = argmaxymaxz1,…zs jjpjUj • Subject to • c1U1 0, c2U2 0, …, cs+1Us+1 0 • where j is a large enough constant to preserve hierarchy
Hidden Variables • Given x • y = h(x,z) • z is not known in training • y = argmaxymaxz score(x,z,y-t,y,t) • Subject to some constraints
Learning with Hidden Variables • Truncated EM styled learning • For each example (x,y) • Compute z with the current classifiers and ILP • z = argmaxz score(x,z,y-t,y,t) • Make prediction with the current classifiers and ILP • (y’,z’) = argmaxy,ztscore(x,z,y-t,y,t) • For each t, update • If y’t yt • Promote score(x,z,y-t,yt,t) • Demote score(x,z’,y’-t,y’t,t)
Conclusion • ILP is • powerful • general • learnable • useful • fast (or at least not too slow) • extendable
Boolean Functions as Linear Constraints • Conjunction • abc Ua + Ub + Uc 3 • Disjunction • abc Ua + Ub + Uc 1 • DNF • ab + cd Iab + Icd 1 • Introduce new variables Iab, Icd
Helper Variables • We must link Ia, Ib, and Iab • Iabab • IaIb Iab • Ia + Ib <= Iab • Iab IaIb • 2Iab <= Ia + Ib
Semantic Role Labeling • a,b,c... ph1=A0, ph1=A1,ph2=A0,.. • Cost Vector • pa = score(ph1=A0) • pb = score(ph1=A1) • ... • Indicator Variables • Ia indicates that phrase 1 is labeled A0 • paIa = 0.3 if Ia and 0 ow
Learning • X = (X1,...,Xk) X1,…,Xk = X • Y-t = (Y1,...,Yt-1,Yt+1,Yl) Y1,…,Yt-1,Yt+1,…,Yl = Y -t • Yt Yt • Given X = x, and Y-t = y-t, find Yt = yt or score of each possible yt • X Y –t Yt or X Y –tYtR
Outline • Find potential argument candidates • Classify arguments to types • Inference for Argument Structure • Integer linear programming (ILP) • Cost Function • Constraints • Features
I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] Find Potential Arguments • Every chunk can be an argument • Restrict potential arguments • BEGIN(word) • BEGIN(word) = 1 “word begins argument” • END(word) • END(word) = 1 “word ends argument” • Argument • (wi,...,wj) is a potential argument iff • BEGIN(wi) = 1 and END(wj) = 1 • Reduce set of potential argments
Details... • BEGIN(word) • Learn a function • B(word,context,structure) {0,1} • END(word) • Learn a function • E(word,context,structure) {0,1} • POTARG = {arg | BEGIN(first(arg)) andEND(last(arg))}
I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] Arguments Type Likelihood • Assign type-likelihood • How likely is it that arg a is type t? • For all aPOTARG , tT • P (argument a = type t ) 0.30.20.20.3 0.60.00.00.4 A0 CA1 A1 Ø
Details... • Learn a classifier • ARGTYPE(arg) • P(arg) {A0,A1,...,CA0,...,LOC,...} • argmaxt{A0,A1,...,CA0,...,LOC,...}wtP(arg) • Estimate Probabilites • P(a = t) = wtP(a) / Z
What is a Good Assignment? • Likelihood of being correct • P(Arg a = Type t) • if t is the correct type for argument a • For a set of arguments a1, a2, ..., an • Expected number of arguments correct • i P( ai = ti ) • We search for the assignment with maximum expected correct
0.30.20.20.3 0.60.00.00.4 0.10.30.50.1 0.10.20.30.4 Cost = 0.3 + 0.4 + 0.3 + 0.4 = 1.4 BlueRed & N-O Cost = 0.3 + 0.4 + 0.5 + 0.4 = 1.6 Non-Overlapping Cost = 0.3 + 0.6 + 0.5 + 0.4 = 1.8 Independent Max Inference • Maximize expected number correct • T* = argmaxT i P( ai = ti ) • Subject to some constraints • Structural and Linguistic I left my nice pearls to her Ileftmy nice pearlsto her
Everything is Linear • Cost function • aPOTARG P(a=t) = aPOTARG , tT P(a=t)Iat • Constraints • Non-Overlapping • a and a’ overlap IaØ+ Ia’Ø = 0 • Linguistic • CA0 A0 a IaCA0 – a IaA0 1 • Integer Linear Programming
Features are Important • Here, a discussion of the features should go. • Which are most important? • Comparison to other people.
I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her I leftmy nice pearlsto her