1 / 31

Efficient Decomposed Learning for Structured Prediction

Efficient Decomposed Learning for Structured Prediction. Rajhans Samdani Joint work with Dan Roth University of Illinois at Urbana-Champaign. Structured Prediction. Structured prediction : predicting a structured output variable y based on the input variable x

bart
Download Presentation

Efficient Decomposed Learning for Structured Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Decomposed Learning for Structured Prediction Rajhans Samdani Joint work with Dan Roth University of Illinois at Urbana-Champaign

  2. Structured Prediction • Structured prediction: predicting a structured output variable ybased on the input variablex • y= {y1,y2,…,yn}variables form a structure • Structure comes from interactions between the output variables through mutual correlations and constraints • Such problems occur frequently in • NLP – e.g. predicting the tree structured parse of a sentence, predicting the entity-relation structure from a document. • Computer vision – scene segmentation, body-part identification • Speech processing – capturing relations between phonemes • Computational Biology – protein folding and interactions between different sub-structures • Etc.

  3. Example Problem: Information Extraction • Given citation text, extract author, booktitle, title, etc. Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages…. • Given ad text, extract features, size, neighborhood, etc. Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines … • Structure introduced by correlations between words • E.g. if treated as sequence-tagging • Structure is also introduced by declarative constraints that define the set of feasible assignments • E.g. the ‘author’ tokens are likely to appear together in a single block • A paper should have at most one ‘title’

  4. Example problem: Body Part Identification • Count the number of people • Predict the body parts • Correlations • Position of shoulders and heads correlated • Position of torso and legs correlated

  5. Structured Prediction: Inference Features on input-output Set of allowed structures often specified by constraints Weight parameters (tobe estimated during learning) Predict variables in y= {y1,y2,…,yn}2 Y together to leverage dependencies (e.g. entity-relation, shoulders-head, information fields, document labels etc.) between these variables Inference constitutes predicting the best scoring structure f(x,y) = w¢Á(x,y) is called the scoring function

  6. Structural Learning: Quick Overview y2 y3 y1 y4 y5 y6 Consider a big monolithic structured prediction problem Given labeled data pairs ( xj,yj= {yj1,yj2,…,yjn}), how do we learn w and perform inference?

  7. Learning w: Two Extreme Styles Global Learning (GL) Local Learning (LL) Ignore hard to learn structural aspects e.g. global constraints/consider variables in isolation Punyakanok et al’05; Roth and Yih’05; Koo et al’10… • Consider all the variables together • Collins’02; Taskar et al’04; Tsochantiridis et al’04 Expensive Inconsistent y2 y3 y1 • LL+C: apply • constraints, • if available, only at • test-time inference y4 y2 y3 y1 y5 y6 y4 y5 y6

  8. Our Contribution: Decomposed Learning y2 y3 y1 000 001 010 011 100 101 110 111 • Related work: • Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07; Pseudomax – Sontag et al, 10 00 01 10 11 y4 y5 y6 00 01 10 11 We consider learning with subsets of variables at a time We give conditions under which this decomposed learning is actually identical to global learning andexhibit the advantage of our learning paradigm experimentally.

  9. Outline • Existing Global Structural learning algorithms • Decomposed Learning (DecL): Efficient structural learning • Intuition • Formalization • Theoretical properties of DecL • Experimental evaluation

  10. Supervised Structural Learning Score of the ground truth y Global Inference over all the variables Loss-based margin Score of non-ground truth y Structured hinge-loss We focus on structural SVM style algorithms which learn w by minimizing regularized structured-hinge loss Literature: Taskar et al’04; Tsochantiridis et al’04;

  11. Limitations of Global Learning • Exact global inference as an intermediate step • Expressive models don’t admit exact and efficient (poly-time) inference algorithms e.g. • HMM with global constraints, • Arbitrary Pairwise Markov Networks • Hence Global Learning is expensive for expressive features (Á(x,y)) and constraints (y 2 Y) • The problem is using inference as a black box during learning • Our proposal: change the inference-during-learning to inference over a smaller output space: Decomposed inference for learning

  12. Outline • Existing Structural learning algorithms • Decomposed Learning (DecL): Efficient structural learning • Intuition • Formalization • Theoretical properties of DecL • Experimental evaluation

  13. Decomposed Structural Learning (DecL) GENERAL IDEA: For (xj,yj), reduce the argmaxinference from the intractable output space Y to a “neighborhood” around yj: nbr(yj)µY Small and tractable nbr(yj) ) efficient learning Use domain knowledge to create neighborhoods which preserve the structure of the problem {0,1}n Y nbr(y) n outputs in y

  14. Neighborhoods via Decompositions • Generate nbr(yj)by varying a subset of the output variables, while fixing the rest of them to their gold labels in yj… • … and repeat the same for different subsets of the output variables • A decompositionis a collection of different (non-inclusive, possibly overlapping) sets of variables which vary together Sj= {s1,…,sl| 8i, siµ {1,…,n}; 8i, k, si*sk} • Inference could be exponential in the size of sets • Smaller set sizes yield efficient learning • Under some conditions, DecL with smaller set sizes is identical to Global Learning

  15. Creating Decompositions • AllowdifferentdecompositionsSjfor differenttraininginstancesyj • Aim to get results close to doing exact inference: we need decompositions which yield exactness(next few slides) • Example: Learning with Decompositions in which all subsets of size k are considered: DecL-k • DecL-1 same as Pseudomax (Sontag et al, 2010) which is similar to Pseudolikelihood (Besag, 77) learning • In practice, decompositions should be based on domain knowledge – put highly coupled variables in the same set

  16. Outline • Existing Structural learning algorithms • DecL: Efficient decomposed structural learning • Intuition • Formalization • Theoretical results: exactness • Experimental evaluation

  17. Theoretical Results: Assume Separability Score of ground truth yj Loss-based margin Score of non ground-truth y • Ideally we want Decomposed Learning with decompositions having small sets to give the same results as Global Learning • For analyzing the equivalence between DecL and GL, we assume that the training data is separable • Separability: existence of a set of weights W* that satisfy W*: {w* | f(xj, yj;w*) ¸f(xj, y ;w*)+ ¢(yj,y), 8y 2Y} • Separating weights for DecL Wdecl: {w* | f(xj, yj;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8y 2nbr(yj)} • Naturally: W* µ Wdecl

  18. Theoretical Results: Exactness • The property we desire is Exactness: Wdecl=W* • as a property of constraints, ground truth yj, and the globally separating weight W* • Different from asymptotic consistency results of Pseudolikelihood/Pseudomax! • Exactness much more useful – learning with DecL yields the same weights as GL • Main theorem in the paper: providing general exactness condition

  19. One Example of Exactness: Pairwise Markov Networks Singleton/Vertex components Pairwise/Edge components y2 y3 y1 y4 y5 y6 • Scoring function define over a graph with edges E • Assume domain knowledge on W*:we know that for correct (separating) w,if Ái,k(.;w)is: • Submodular: Ái,k(0,0)+ Ái,k(1,1) >Ái,k(0,1) + Ái,k(1,0) OR • Supermodular: Ái,k(0,0)+ Ái,k(1,1) <Ái,k(0,1) + Ái,k(1,0)

  20. Decomposition for PMNs 0 1 sub(Á) sup(Á) E Ej Define Theorem:Spair decomposition consisting of connected components of Ejyields Exactness

  21. Outline • Existing Structural learning algorithms • DecL: Efficient decomposed structural learning • Intuition • Formalization • Theoretical properties of DecL • Experimental evaluation

  22. Experiments • Experimentally compare Decomposed Learning (DecL) to • Global Learning (GL), • Local Learning (LL) and • Local Learning + Constraints (if available, during test-time inference) (LL+C) • Study the robustness of DecL in conditions where our theoretical assumptions may not hold

  23. Synthetic Experiments Local Learning (LL) baselines Avg. Hamming Loss DecL-1 aka Pseudomax Global Learning (GL) and Dec. Learning (DecL)-2,3 No. of training examples Experiments on random synthetic data with 10 binary variables Labels assigned with random singleton scoring functions and random linear constraints

  24. Multi-label Document Classification • Experiments on multi-label document classification • Documents with multi-labels corn, crude, earn, grain, interest… • Modeled as a Pairwise Markov Network over a complete graph over all the labels – singleton and pairwise components • LL – local learning baseline that ignores pairwise interactions

  25. Results: Per Instance F1 and training time (hours) F1 Scores Time taken to train (hours)

  26. Results: Per Instance F1 and training time (hours) F1 Scores Time taken to train (hours)

  27. Example Problem: Information Extraction • Given citation text, extract author, booktitle, title, etc. Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages…. • Given ad text, extract features, size, neighborhood, etc. Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines … • Constraints like: • The ‘title’ tokens are likely to appear together in a single block, • A paper should have at most one ‘title’

  28. Information Extraction: Modeling • Modeled as HMM with additional constraints • The constraints make inference with HMM Hard • Local Learning (LL) in this case is HMM with no constraints • Domain Knolwedge: HMM transition matrix is likely to be diagonal heavy – generalization of submodularpairwise potentials for Pairwise Markov Networks • )use decomposition Spair • Bottomline:DecL is 2 to 8 times faster than GL and gives same accuracies

  29. Citation Info.Extraction: Accuracy and Training Time Time taken to train (hours) F1 Scores

  30. Ads. Info. Extraction: Accuracy and Training Time Time taken to train (hours) F1 Scores

  31. Take Home: Efficient Structural Learning with DecL We presented Decomposed Learning (DecL): efficient learning by reducing the inference to a small output space Exactness: Provided conditions for when DecL is provably identical to global structural learning (GL) Experiments: DecL performs as good as GL on real-world data, with significant cost reduction (with 50% - 90% reduction in training time) QUESTIONS?

More Related