310 likes | 445 Views
Efficient Decomposed Learning for Structured Prediction. Rajhans Samdani Joint work with Dan Roth University of Illinois at Urbana-Champaign. Structured Prediction. Structured prediction : predicting a structured output variable y based on the input variable x
E N D
Efficient Decomposed Learning for Structured Prediction Rajhans Samdani Joint work with Dan Roth University of Illinois at Urbana-Champaign
Structured Prediction • Structured prediction: predicting a structured output variable ybased on the input variablex • y= {y1,y2,…,yn}variables form a structure • Structure comes from interactions between the output variables through mutual correlations and constraints • Such problems occur frequently in • NLP – e.g. predicting the tree structured parse of a sentence, predicting the entity-relation structure from a document. • Computer vision – scene segmentation, body-part identification • Speech processing – capturing relations between phonemes • Computational Biology – protein folding and interactions between different sub-structures • Etc.
Example Problem: Information Extraction • Given citation text, extract author, booktitle, title, etc. Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages…. • Given ad text, extract features, size, neighborhood, etc. Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines … • Structure introduced by correlations between words • E.g. if treated as sequence-tagging • Structure is also introduced by declarative constraints that define the set of feasible assignments • E.g. the ‘author’ tokens are likely to appear together in a single block • A paper should have at most one ‘title’
Example problem: Body Part Identification • Count the number of people • Predict the body parts • Correlations • Position of shoulders and heads correlated • Position of torso and legs correlated
Structured Prediction: Inference Features on input-output Set of allowed structures often specified by constraints Weight parameters (tobe estimated during learning) Predict variables in y= {y1,y2,…,yn}2 Y together to leverage dependencies (e.g. entity-relation, shoulders-head, information fields, document labels etc.) between these variables Inference constitutes predicting the best scoring structure f(x,y) = w¢Á(x,y) is called the scoring function
Structural Learning: Quick Overview y2 y3 y1 y4 y5 y6 Consider a big monolithic structured prediction problem Given labeled data pairs ( xj,yj= {yj1,yj2,…,yjn}), how do we learn w and perform inference?
Learning w: Two Extreme Styles Global Learning (GL) Local Learning (LL) Ignore hard to learn structural aspects e.g. global constraints/consider variables in isolation Punyakanok et al’05; Roth and Yih’05; Koo et al’10… • Consider all the variables together • Collins’02; Taskar et al’04; Tsochantiridis et al’04 Expensive Inconsistent y2 y3 y1 • LL+C: apply • constraints, • if available, only at • test-time inference y4 y2 y3 y1 y5 y6 y4 y5 y6
Our Contribution: Decomposed Learning y2 y3 y1 000 001 010 011 100 101 110 111 • Related work: • Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07; Pseudomax – Sontag et al, 10 00 01 10 11 y4 y5 y6 00 01 10 11 We consider learning with subsets of variables at a time We give conditions under which this decomposed learning is actually identical to global learning andexhibit the advantage of our learning paradigm experimentally.
Outline • Existing Global Structural learning algorithms • Decomposed Learning (DecL): Efficient structural learning • Intuition • Formalization • Theoretical properties of DecL • Experimental evaluation
Supervised Structural Learning Score of the ground truth y Global Inference over all the variables Loss-based margin Score of non-ground truth y Structured hinge-loss We focus on structural SVM style algorithms which learn w by minimizing regularized structured-hinge loss Literature: Taskar et al’04; Tsochantiridis et al’04;
Limitations of Global Learning • Exact global inference as an intermediate step • Expressive models don’t admit exact and efficient (poly-time) inference algorithms e.g. • HMM with global constraints, • Arbitrary Pairwise Markov Networks • Hence Global Learning is expensive for expressive features (Á(x,y)) and constraints (y 2 Y) • The problem is using inference as a black box during learning • Our proposal: change the inference-during-learning to inference over a smaller output space: Decomposed inference for learning
Outline • Existing Structural learning algorithms • Decomposed Learning (DecL): Efficient structural learning • Intuition • Formalization • Theoretical properties of DecL • Experimental evaluation
Decomposed Structural Learning (DecL) GENERAL IDEA: For (xj,yj), reduce the argmaxinference from the intractable output space Y to a “neighborhood” around yj: nbr(yj)µY Small and tractable nbr(yj) ) efficient learning Use domain knowledge to create neighborhoods which preserve the structure of the problem {0,1}n Y nbr(y) n outputs in y
Neighborhoods via Decompositions • Generate nbr(yj)by varying a subset of the output variables, while fixing the rest of them to their gold labels in yj… • … and repeat the same for different subsets of the output variables • A decompositionis a collection of different (non-inclusive, possibly overlapping) sets of variables which vary together Sj= {s1,…,sl| 8i, siµ {1,…,n}; 8i, k, si*sk} • Inference could be exponential in the size of sets • Smaller set sizes yield efficient learning • Under some conditions, DecL with smaller set sizes is identical to Global Learning
Creating Decompositions • AllowdifferentdecompositionsSjfor differenttraininginstancesyj • Aim to get results close to doing exact inference: we need decompositions which yield exactness(next few slides) • Example: Learning with Decompositions in which all subsets of size k are considered: DecL-k • DecL-1 same as Pseudomax (Sontag et al, 2010) which is similar to Pseudolikelihood (Besag, 77) learning • In practice, decompositions should be based on domain knowledge – put highly coupled variables in the same set
Outline • Existing Structural learning algorithms • DecL: Efficient decomposed structural learning • Intuition • Formalization • Theoretical results: exactness • Experimental evaluation
Theoretical Results: Assume Separability Score of ground truth yj Loss-based margin Score of non ground-truth y • Ideally we want Decomposed Learning with decompositions having small sets to give the same results as Global Learning • For analyzing the equivalence between DecL and GL, we assume that the training data is separable • Separability: existence of a set of weights W* that satisfy W*: {w* | f(xj, yj;w*) ¸f(xj, y ;w*)+ ¢(yj,y), 8y 2Y} • Separating weights for DecL Wdecl: {w* | f(xj, yj;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8y 2nbr(yj)} • Naturally: W* µ Wdecl
Theoretical Results: Exactness • The property we desire is Exactness: Wdecl=W* • as a property of constraints, ground truth yj, and the globally separating weight W* • Different from asymptotic consistency results of Pseudolikelihood/Pseudomax! • Exactness much more useful – learning with DecL yields the same weights as GL • Main theorem in the paper: providing general exactness condition
One Example of Exactness: Pairwise Markov Networks Singleton/Vertex components Pairwise/Edge components y2 y3 y1 y4 y5 y6 • Scoring function define over a graph with edges E • Assume domain knowledge on W*:we know that for correct (separating) w,if Ái,k(.;w)is: • Submodular: Ái,k(0,0)+ Ái,k(1,1) >Ái,k(0,1) + Ái,k(1,0) OR • Supermodular: Ái,k(0,0)+ Ái,k(1,1) <Ái,k(0,1) + Ái,k(1,0)
Decomposition for PMNs 0 1 sub(Á) sup(Á) E Ej Define Theorem:Spair decomposition consisting of connected components of Ejyields Exactness
Outline • Existing Structural learning algorithms • DecL: Efficient decomposed structural learning • Intuition • Formalization • Theoretical properties of DecL • Experimental evaluation
Experiments • Experimentally compare Decomposed Learning (DecL) to • Global Learning (GL), • Local Learning (LL) and • Local Learning + Constraints (if available, during test-time inference) (LL+C) • Study the robustness of DecL in conditions where our theoretical assumptions may not hold
Synthetic Experiments Local Learning (LL) baselines Avg. Hamming Loss DecL-1 aka Pseudomax Global Learning (GL) and Dec. Learning (DecL)-2,3 No. of training examples Experiments on random synthetic data with 10 binary variables Labels assigned with random singleton scoring functions and random linear constraints
Multi-label Document Classification • Experiments on multi-label document classification • Documents with multi-labels corn, crude, earn, grain, interest… • Modeled as a Pairwise Markov Network over a complete graph over all the labels – singleton and pairwise components • LL – local learning baseline that ignores pairwise interactions
Results: Per Instance F1 and training time (hours) F1 Scores Time taken to train (hours)
Results: Per Instance F1 and training time (hours) F1 Scores Time taken to train (hours)
Example Problem: Information Extraction • Given citation text, extract author, booktitle, title, etc. Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages…. • Given ad text, extract features, size, neighborhood, etc. Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines … • Constraints like: • The ‘title’ tokens are likely to appear together in a single block, • A paper should have at most one ‘title’
Information Extraction: Modeling • Modeled as HMM with additional constraints • The constraints make inference with HMM Hard • Local Learning (LL) in this case is HMM with no constraints • Domain Knolwedge: HMM transition matrix is likely to be diagonal heavy – generalization of submodularpairwise potentials for Pairwise Markov Networks • )use decomposition Spair • Bottomline:DecL is 2 to 8 times faster than GL and gives same accuracies
Citation Info.Extraction: Accuracy and Training Time Time taken to train (hours) F1 Scores
Ads. Info. Extraction: Accuracy and Training Time Time taken to train (hours) F1 Scores
Take Home: Efficient Structural Learning with DecL We presented Decomposed Learning (DecL): efficient learning by reducing the inference to a small output space Exactness: Provided conditions for when DecL is provably identical to global structural learning (GL) Experiments: DecL performs as good as GL on real-world data, with significant cost reduction (with 50% - 90% reduction in training time) QUESTIONS?