10 likes | 126 Views
Efficient Decomposed Learning for Structured Prediction R ajhans Samdani and Dan Roth, University of Illinois at Urbana-Champaign. Experiments. DecL: Learning via Decompositions. Structured Prediction. Synthetic data: random linear scoring function with random constraints.
E N D
Efficient Decomposed Learning for Structured Prediction Rajhans Samdani and Dan Roth, University of Illinois at Urbana-Champaign Experiments DecL: Learning via Decompositions Structured Prediction • Synthetic data: random linear scoring function with random constraints • Predicty= {y1,y2,…,yn} 2Ygiven input x • Features: Á (x, y);weight parameters: w • Inference: • argmaxy2Y f(x,y) = w¢Á(x,y) • Learning: estimate w Learn by varying a subset of the output variables, while fixing the remaining to their gold labels in yj ! y2 y3 y2 y3 y3 y3 y2 y1 Local Learning (LL) baselines y1 y2 y1 y1 y4 y4 Avg. Hamming Loss y4 y4 Global Learning (GL) DecL-1 aka Pseudomax y5 y5 y6 y6 y5 y5 y6 • Decomposition is a collection of different (non-inclusive, possibly overlapping) sets of variables which we perform argmax over Sj = {s1,…,sl| 8i, siµ {1,…,n}; 8i, k, si*sk} • Learning with Decompositions in which all subsets of size k are considered: DecL-k • In practice, decompositions based on domain knowledge highly coupled variables together y6 • Structural SVMs learn by minimizing Global Learning (GL) and Dec. Learning (DecL)-2,3 No. of training examples • Information extraction: • Given a citation, extract author, book-title, titleetc. • Given ads text, extract features, size, neighborhood, etc. • Constraints like: • ‘Title’ tokens are likely to appear together in a single block, • A paper should have at most one ‘title’ • Domain Knowledge: HMM transition matrix is diagonal heavy – generalization of submodular pairwise potentials.) Theoretical Results: Decompositions which yield Exactness Exact Inference Update W *: {w* | f(xj, yj ;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8y 2Y , yj2training-data } Wdecl: {w* | f(xj, yj ;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8y 2nbr(yj), yj2training-data } • Global inference as an intermediate step • Exactness:DecL is exact if it has the same set of separating weights as GL – Wdecl =W* • Exactness with finite data is much more useful than asymptotic consistency Main Theorem:DecL is exact if 8w2W *, 9² > 0, such that 8w’2 B(w,²), 8 ( xj,yj) 2 D we have if 9y2Y such that f(xj, y ; w′) + ¢(yj,y) > f(xj, yj; w′) then 9y’2nbr(yj) with f(xj,y’; w′) + ¢(yj,y’) > f(xj, yj; w′) • Global Inference slow ) Global learning time consuming!!! For weights immediately outside W*, • Global Inseparability ) • DecL Inseparability Baseline: Local Learning (LL) • Approximations to GL which ignore certain structural interactions such that the remain structure become easy to learn • E.g. ignoring global constraints or pairwise interactions in a Markov network • Another baseline: LL+Cwhere we learn pieces independently and apply full structural inference (e.g. constraints, if available), during test-time • Fast but oblivious to rich structural information!! Exactness for Special Cases Training Time (hours) Accuracy • Pairwise Markov Network over a graph with edges E • Assume domain knowledge onW*:we know that for separating w,if Ái,k (.;w)is: • Submodular:Ái,k(0,0)+ Ái,k(1,1) >Ái,k(0,1) + Ái,k(1,0) OR • Supermodular:Ái,k(0,0)+ Ái,k(1,1) <Ái,k(0,1) + Ái,k(1,0) • . • Linear scoring function structured via constraints on Y • For simple constraints, possible to show exactness for decompositions with set sizes independent of n • Theorem: If Yis specified by kORconstraints, then DecL-(k+1)is exact • As an example consequence, when Yis specified by khorn-clauses: y1,1Æy1,2 … Æy1,r!y1,r+1, y2,1Æy2,2 … Æy2,r!y2,r+1, yk,1Æyk,2 … Æyk,l!yk,r+1 • decompositions with set-size (k+1), i.e. • independent of the number of variables • in constraints, r, yield exactness. • Multi-label Document Classification • Experiments on Reuters data • Documents with multi-labels corn, crude, earn, grain, interest… • Modeled as a PMN over a complete graph over the labels – singleton and pairwise components Decomposed learning (DecL) • Reduce inference to a neighborhood • around yj 0 1 sub(Á) sup(Á) • Small neighborhoods ) efficient learning • We theoretically and experimentally show that Decomposed Learning with small neighborhoods can be identical to Global Learning (GL) E Ej F1 Scores Training Time (hours) Theorem:Spairdecomposition consisting of connected components of Ejyields Exactness Supported by the Army Research Laboratory (ARL), Defense Advanced Research Projects Agency (DARPA), and the Office of Naval Research (ONR)..