Inference and Learning via Integer Linear Programming

Inference and Learning via Integer Linear Programming Vasin,Dan,Scott,Dav

Outline • Problem Definition • Integer Linear Programming (ILP) • Its generality • Learning and Inference via ILP • Experiments • Extension to hierarchical learning • Future Direction • Hidden Variables

Problem Definition • X = (X1,...,Xk)  X k = X • Y = (Y1,...,Yl)  Y l = Y • Given X = x, find Y = y • Notation agreements • Capital letters mean variables • Non-capital letters mean values • Bold indicates vectors or matrixes • X,Yis a set

Example (Text Chunking) y = NP ADJP VP ADVP VP x = The guy presenting now is so tired

Classifiers • A classifier • h: X Y (l-1)Y {1,..,l} R • Example • score(x,y-3,NP,3) = 0.3 • score(x,y-3,VP,3) = 0.5 • score(x,y-3,ADVP,3) = 0.2 • score(x,y-3,ADJP,3) = 1.2 • score(x,y-3,NULL,3) = 0.1

Inference • Goal: x y • Given • x input • score(x,y-t,y,t) for all (y-t,y)  Y l, t  {1,..,l} • C A set of constraints over Y • Find y • maximizes global function • score(x,y) = t score(x,y-t,yt,t) • satisfies constraints C

Integer Linear Programming • Boolean variables: U = (U1,...,Ud)  {0,1}d • Cost vector: p = (p1,…,pd)  Rd • Cost Function: pU • Constraint Matrix: cReRd • Maximize pU • Subject to cU 0 (cU=0, cU3, possible)

ILP (Example) • U = (U1,U2,U3) • p = (0.3, 0.5, 0.8) • c = 1 2 3 -1 -2 2 0 -3 2 • Maximize pU • Subject to cU 0

Boolean Functions as Linear Constraints • Conjunction • U1U2U3 U1=1, U2=1, U3=1 • Disjunction • U1U2U3 U1 + U2 + U3 1 • CNF • (U1U2)(U3U4)  U1+U2 1, U3+U4  1

Text Chunking • Indicator Variables • U1,NP,U1,NULL,U2,VP... y1=NP, y1=NULL, y2=VP,.. • U1,NP indicates that phrase 1 is labeled NP • Cost Vector • p1,NP = score(x,NP,1) • p1,NULL = score(x,NULL,1) • p2,VP = score(x,VP,2) • ... • pU = score(x,y) = t score(x,yt,t), subject to constraints

Structural Constraints • Coherency • yt can take only one value • y{NP,..NULL} Ut,y = 1 • Non-Overlapping • y1 and y2 overlap • U1,NULL + U2,NULL = 1

Linguistic Constraints • Every sentence must have at least one VP • t Ut,VP  1 • Every sentence must have at least one NP • t Ut,NP  1 • ...

Interacting Classifiers • Classifier for an output yt uses other outputs y-t as inputs • score(x,y-t,y,t) • Need to ensure that the final output from ILP computed from a consistent y • Introduce additional variables • Introduce additional coherency constraints

Interacting Classifiers • Additional variables • Y=yUY,y for all possible y-t,y • Additional coherency constraints • UY,y = 1 iff Ut,yt = 1 for all yt in y • yt in yUt,yt - UY,y  l – 1 • yt in yUt,yt - lUY,y  0

Learning Classifiers • score(x,y-t,y,t) = yy(x,y-t,t) • Learn y, for all y  Y • Multi-class learning • Example (x,y)  {y(x,y-t,t),yt}t=1..l • Learn each classifier independently

Learn with Inference Feedback • Learn by observing global behavior • For each example (x,y) • Make prediction with the current classifiers and ILP • y’ = argmaxytscore(x,y-t,y,t) • For each t, update • If y’t  yt • Promote score(x,y-t,yt,t) • Demote score(x,y’-t,y’t,t)

Experiments • Semantics Role Labeling • Assume correct boundaries are given • Only sentences with more than 5 arguments are included

Experimental Results Winnow Perceptron • For difficult task: • Inference feedback during training improves performance • For easy task: • Learning without inference feedback is better

Conservative Updating • Update only if necessary • Example • U1 + U2 = 1 • Predict (U1, U2) = (1,0) • Correct (U1, U2) = (0,1) • Feedback Demote class 1, promote class 2 • So, U1=0  U2=1, so only demote class 1

Conservative Updating • S = minset(Constraints) • Set of functions that, if changed, would make global prediction correct. • Promote (Demote) only those functions in the minset S

Hierarchical Learning • Given x • Compute hierarchically • z1 = h1(x) • z2 = h2(x,z1) • … • y = hs+1(x,z1,…,zs) • Assume all z are known in training

Hierarchical Learning • Assume each hj can be computed via ILP • pj, Uj, cj • y = argmaxymaxz1,…zs jjpjUj • Subject to • c1U1  0, c2U2  0, …, cs+1Us+1  0 • where j is a large enough constant to preserve hierarchy

Hidden Variables • Given x • y = h(x,z) • z is not known in training • y = argmaxymaxz score(x,z,y-t,y,t) • Subject to some constraints

Learning with Hidden Variables • Truncated EM styled learning • For each example (x,y) • Compute z with the current classifiers and ILP • z = argmaxz score(x,z,y-t,y,t) • Make prediction with the current classifiers and ILP • (y’,z’) = argmaxy,ztscore(x,z,y-t,y,t) • For each t, update • If y’t  yt • Promote score(x,z,y-t,yt,t) • Demote score(x,z’,y’-t,y’t,t)

Conclusion • ILP is • powerful • general • learnable • useful • fast (or at least not too slow) • extendable

Boolean Functions as Linear Constraints • Conjunction • abc  Ua + Ub + Uc 3 • Disjunction • abc  Ua + Ub + Uc 1 • DNF • ab + cd  Iab + Icd 1 • Introduce new variables Iab, Icd

Helper Variables • We must link Ia, Ib, and Iab • Iabab • IaIb  Iab • Ia + Ib <= Iab • Iab  IaIb • 2Iab <= Ia + Ib

Semantic Role Labeling • a,b,c... ph1=A0, ph1=A1,ph2=A0,.. • Cost Vector • pa = score(ph1=A0) • pb = score(ph1=A1) • ... • Indicator Variables • Ia indicates that phrase 1 is labeled A0 • paIa = 0.3 if Ia and 0 ow

Learning • X = (X1,...,Xk)  X1,…,Xk = X • Y-t = (Y1,...,Yt-1,Yt+1,Yl)  Y1,…,Yt-1,Yt+1,…,Yl = Y -t • Yt Yt • Given X = x, and Y-t = y-t, find Yt = yt or score of each possible yt • X Y –t  Yt or X Y –tYtR

SRL via Generalized Inference

Outline • Find potential argument candidates • Classify arguments to types • Inference for Argument Structure • Integer linear programming (ILP) • Cost Function • Constraints • Features

I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] Find Potential Arguments • Every chunk can be an argument • Restrict potential arguments • BEGIN(word) • BEGIN(word) = 1  “word begins argument” • END(word) • END(word) = 1  “word ends argument” • Argument • (wi,...,wj) is a potential argument iff • BEGIN(wi) = 1 and END(wj) = 1 • Reduce set of potential argments

Details... • BEGIN(word) • Learn a function • B(word,context,structure)  {0,1} • END(word) • Learn a function • E(word,context,structure)  {0,1} • POTARG = {arg | BEGIN(first(arg)) andEND(last(arg))}

I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] Arguments Type Likelihood • Assign type-likelihood • How likely is it that arg a is type t? • For all aPOTARG , tT • P (argument a = type t ) 0.30.20.20.3 0.60.00.00.4 A0 CA1 A1 Ø

Details... • Learn a classifier • ARGTYPE(arg) • P(arg)  {A0,A1,...,CA0,...,LOC,...} • argmaxt{A0,A1,...,CA0,...,LOC,...}wtP(arg) • Estimate Probabilites • P(a = t) = wtP(a) / Z

What is a Good Assignment? • Likelihood of being correct • P(Arg a = Type t) • if t is the correct type for argument a • For a set of arguments a1, a2, ..., an • Expected number of arguments correct • i P( ai = ti ) • We search for the assignment with maximum expected correct

0.30.20.20.3 0.60.00.00.4 0.10.30.50.1 0.10.20.30.4 Cost = 0.3 + 0.4 + 0.3 + 0.4 = 1.4 BlueRed & N-O Cost = 0.3 + 0.4 + 0.5 + 0.4 = 1.6 Non-Overlapping Cost = 0.3 + 0.6 + 0.5 + 0.4 = 1.8 Independent Max Inference • Maximize expected number correct • T* = argmaxT i P( ai = ti ) • Subject to some constraints • Structural and Linguistic I left my nice pearls to her Ileftmy nice pearlsto her

Everything is Linear • Cost function • aPOTARG P(a=t) = aPOTARG , tT P(a=t)Iat • Constraints • Non-Overlapping • a and a’ overlap  IaØ+ Ia’Ø = 0 • Linguistic •  CA0   A0  a IaCA0 – a IaA0  1 • Integer Linear Programming

Features are Important • Here, a discussion of the features should go. • Which are most important? • Comparison to other people.

I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her I leftmy nice pearlsto her

Inference and Learning via Integer Linear Programming