Efficient Decomposed Learning for Structured Prediction

Efficient Decomposed Learning for Structured Prediction Rajhans Samdani Joint work with Dan Roth University of Illinois at Urbana-Champaign

Structured Prediction • Structured prediction: predicting a structured output variable ybased on the input variablex • y= {y1,y2,…,yn}variables form a structure • Structure comes from interactions between the output variables through mutual correlations and constraints • Such problems occur frequently in • NLP – e.g. predicting the tree structured parse of a sentence, predicting the entity-relation structure from a document. • Computer vision – scene segmentation, body-part identification • Speech processing – capturing relations between phonemes • Computational Biology – protein folding and interactions between different sub-structures • Etc.

Example Problem: Information Extraction • Given citation text, extract author, booktitle, title, etc. Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages…. • Given ad text, extract features, size, neighborhood, etc. Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines … • Structure introduced by correlations between words • E.g. if treated as sequence-tagging • Structure is also introduced by declarative constraints that define the set of feasible assignments • E.g. the ‘author’ tokens are likely to appear together in a single block • A paper should have at most one ‘title’

Example problem: Body Part Identification • Count the number of people • Predict the body parts • Correlations • Position of shoulders and heads correlated • Position of torso and legs correlated

Structured Prediction: Inference Features on input-output Set of allowed structures often specified by constraints Weight parameters (tobe estimated during learning) Predict variables in y= {y1,y2,…,yn}2 Y together to leverage dependencies (e.g. entity-relation, shoulders-head, information fields, document labels etc.) between these variables Inference constitutes predicting the best scoring structure f(x,y) = w¢Á(x,y) is called the scoring function

Structural Learning: Quick Overview y2 y3 y1 y4 y5 y6 Consider a big monolithic structured prediction problem Given labeled data pairs ( xj,yj= {yj1,yj2,…,yjn}), how do we learn w and perform inference?

Learning w: Two Extreme Styles Global Learning (GL) Local Learning (LL) Ignore hard to learn structural aspects e.g. global constraints/consider variables in isolation Punyakanok et al’05; Roth and Yih’05; Koo et al’10… • Consider all the variables together • Collins’02; Taskar et al’04; Tsochantiridis et al’04 Expensive Inconsistent y2 y3 y1 • LL+C: apply • constraints, • if available, only at • test-time inference y4 y2 y3 y1 y5 y6 y4 y5 y6

Our Contribution: Decomposed Learning y2 y3 y1 000 001 010 011 100 101 110 111 • Related work: • Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07; Pseudomax – Sontag et al, 10 00 01 10 11 y4 y5 y6 00 01 10 11 We consider learning with subsets of variables at a time We give conditions under which this decomposed learning is actually identical to global learning andexhibit the advantage of our learning paradigm experimentally.

Outline • Existing Global Structural learning algorithms • Decomposed Learning (DecL): Efficient structural learning • Intuition • Formalization • Theoretical properties of DecL • Experimental evaluation

Supervised Structural Learning Score of the ground truth y Global Inference over all the variables Loss-based margin Score of non-ground truth y Structured hinge-loss We focus on structural SVM style algorithms which learn w by minimizing regularized structured-hinge loss Literature: Taskar et al’04; Tsochantiridis et al’04;

Limitations of Global Learning • Exact global inference as an intermediate step • Expressive models don’t admit exact and efficient (poly-time) inference algorithms e.g. • HMM with global constraints, • Arbitrary Pairwise Markov Networks • Hence Global Learning is expensive for expressive features (Á(x,y)) and constraints (y 2 Y) • The problem is using inference as a black box during learning • Our proposal: change the inference-during-learning to inference over a smaller output space: Decomposed inference for learning

Outline • Existing Structural learning algorithms • Decomposed Learning (DecL): Efficient structural learning • Intuition • Formalization • Theoretical properties of DecL • Experimental evaluation

Decomposed Structural Learning (DecL) GENERAL IDEA: For (xj,yj), reduce the argmaxinference from the intractable output space Y to a “neighborhood” around yj: nbr(yj)µY Small and tractable nbr(yj) ) efficient learning Use domain knowledge to create neighborhoods which preserve the structure of the problem {0,1}n Y nbr(y) n outputs in y

Neighborhoods via Decompositions • Generate nbr(yj)by varying a subset of the output variables, while fixing the rest of them to their gold labels in yj… • … and repeat the same for different subsets of the output variables • A decompositionis a collection of different (non-inclusive, possibly overlapping) sets of variables which vary together Sj= {s1,…,sl| 8i, siµ {1,…,n}; 8i, k, si*sk} • Inference could be exponential in the size of sets • Smaller set sizes yield efficient learning • Under some conditions, DecL with smaller set sizes is identical to Global Learning

Creating Decompositions • AllowdifferentdecompositionsSjfor differenttraininginstancesyj • Aim to get results close to doing exact inference: we need decompositions which yield exactness(next few slides) • Example: Learning with Decompositions in which all subsets of size k are considered: DecL-k • DecL-1 same as Pseudomax (Sontag et al, 2010) which is similar to Pseudolikelihood (Besag, 77) learning • In practice, decompositions should be based on domain knowledge – put highly coupled variables in the same set

Outline • Existing Structural learning algorithms • DecL: Efficient decomposed structural learning • Intuition • Formalization • Theoretical results: exactness • Experimental evaluation

Theoretical Results: Assume Separability Score of ground truth yj Loss-based margin Score of non ground-truth y • Ideally we want Decomposed Learning with decompositions having small sets to give the same results as Global Learning • For analyzing the equivalence between DecL and GL, we assume that the training data is separable • Separability: existence of a set of weights W* that satisfy W*: {w* | f(xj, yj;w*) ¸f(xj, y ;w*)+ ¢(yj,y), 8y 2Y} • Separating weights for DecL Wdecl: {w* | f(xj, yj;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8y 2nbr(yj)} • Naturally: W* µ Wdecl

Theoretical Results: Exactness • The property we desire is Exactness: Wdecl=W* • as a property of constraints, ground truth yj, and the globally separating weight W* • Different from asymptotic consistency results of Pseudolikelihood/Pseudomax! • Exactness much more useful – learning with DecL yields the same weights as GL • Main theorem in the paper: providing general exactness condition

One Example of Exactness: Pairwise Markov Networks Singleton/Vertex components Pairwise/Edge components y2 y3 y1 y4 y5 y6 • Scoring function define over a graph with edges E • Assume domain knowledge on W*:we know that for correct (separating) w,if Ái,k(.;w)is: • Submodular: Ái,k(0,0)+ Ái,k(1,1) >Ái,k(0,1) + Ái,k(1,0) OR • Supermodular: Ái,k(0,0)+ Ái,k(1,1) <Ái,k(0,1) + Ái,k(1,0)

Decomposition for PMNs 0 1 sub(Á) sup(Á) E Ej Define Theorem:Spair decomposition consisting of connected components of Ejyields Exactness

Outline • Existing Structural learning algorithms • DecL: Efficient decomposed structural learning • Intuition • Formalization • Theoretical properties of DecL • Experimental evaluation

Experiments • Experimentally compare Decomposed Learning (DecL) to • Global Learning (GL), • Local Learning (LL) and • Local Learning + Constraints (if available, during test-time inference) (LL+C) • Study the robustness of DecL in conditions where our theoretical assumptions may not hold

Synthetic Experiments Local Learning (LL) baselines Avg. Hamming Loss DecL-1 aka Pseudomax Global Learning (GL) and Dec. Learning (DecL)-2,3 No. of training examples Experiments on random synthetic data with 10 binary variables Labels assigned with random singleton scoring functions and random linear constraints

Multi-label Document Classification • Experiments on multi-label document classification • Documents with multi-labels corn, crude, earn, grain, interest… • Modeled as a Pairwise Markov Network over a complete graph over all the labels – singleton and pairwise components • LL – local learning baseline that ignores pairwise interactions

Results: Per Instance F1 and training time (hours) F1 Scores Time taken to train (hours)

Example Problem: Information Extraction • Given citation text, extract author, booktitle, title, etc. Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages…. • Given ad text, extract features, size, neighborhood, etc. Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines … • Constraints like: • The ‘title’ tokens are likely to appear together in a single block, • A paper should have at most one ‘title’

Information Extraction: Modeling • Modeled as HMM with additional constraints • The constraints make inference with HMM Hard • Local Learning (LL) in this case is HMM with no constraints • Domain Knolwedge: HMM transition matrix is likely to be diagonal heavy – generalization of submodularpairwise potentials for Pairwise Markov Networks • )use decomposition Spair • Bottomline:DecL is 2 to 8 times faster than GL and gives same accuracies

Citation Info.Extraction: Accuracy and Training Time Time taken to train (hours) F1 Scores

Ads. Info. Extraction: Accuracy and Training Time Time taken to train (hours) F1 Scores

Take Home: Efficient Structural Learning with DecL We presented Decomposed Learning (DecL): efficient learning by reducing the inference to a small output space Exactness: Provided conditions for when DecL is provably identical to global structural learning (GL) Experiments: DecL performs as good as GL on real-world data, with significant cost reduction (with 50% - 90% reduction in training time) QUESTIONS?

Efficient Decomposed Learning for Structured Prediction

Efficient Decomposed Learning for Structured Prediction

Presentation Transcript

Beam-Width Prediction for Efficient Context-Free Parsing

Efficient Large-Scale Structured Learning

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction

Learning for Structured Prediction Overview of the Material

Structured learning

Structured Learning Conversations

Information Retrieval as Structured Prediction

Efficient Algorithms for Mining Semi-structured Data

Structured Prediction: A Large Margin Approach

Semi-supervised Structured Prediction Models

Learning structured ouputs

Structured Workplace Learning

Efficient Prediction Structure for Multi-view Video Coding

Diversified Retrieval as Structured Prediction

Learning Structured Models for Phone Recognition

Structured Prediction with Perceptrons and CRFs

Structured learning

Search-Based Structured Prediction

Learning Structured Prediction Models: A Large Margin Approach

Transfer Learning for Link Prediction

Structured Prediction and Active Learning for Information Retrieval

Efficient Image Classification on Vertically Decomposed Data