300 likes | 425 Views
Decomposing Structured Prediction via Constrained Conditional Models. Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign. With thanks to: Collaborators: Ming-Wei Chang , Lev Ratinov, Rajhans Samdani, Vivek Srikumar, Many others
E N D
Decomposing Structured Prediction viaConstrained Conditional Models Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign With thanks to: Collaborators:Ming-Wei Chang, Lev Ratinov, Rajhans Samdani, Vivek Srikumar, Many others Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP) June 2013 SLG Workshop, ICML, Atlanta GA
Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. This is an Inference Problem
Learning and Inference • Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. • In current NLP we often think about simpler structured problems: Parsing, Information Extraction, SRL, etc. • As we move up the problem hierarchy (Textual Entailment, QA,….) not all component models can be learned simultaneously • We need to think about (learned) models for different sub-problems, often pipelined • Knowledge relating sub-problems (constraints) becomes more essential and may appear only at evaluation time • Goal: Incorporate models’ information, along with prior knowledge (constraints) in making coherent decisions • Decisions that respect the local models as well as domain & context specific knowledge/constraints.
Outline • Constrained Conditional Models • A formulation for global inference with knowledge modeled as expressive structural constraints • A Structured Prediction Perspective • Decomposed Learning (DecL) • Efficient structure learning by reducing the learning-time-inference to a small output space • Provide conditions for when DecL is provably identical to global structural learning (GL)
Three Ideas Underlying Constrained Conditional Models Modeling • Idea 1: Separate modeling and problem formulation from algorithms • Similar to the philosophy of probabilistic modeling • Idea 2: Keep models simple, make expressive decisions (via constraints) • Unlike probabilistic modeling, where models become more expressive • Idea 3: Expressive structured decisions can be supported by simply learned models • Amplified and minimally supervised by exploiting dependencies among models’ outcomes. Inference Learning
Inference with General Constraint Structure [Roth&Yih’04,07]Recognizing Entities and Relations Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 Significant performance Improvement y = argmaxyscore[y=v]1[y=v]= = argmaxscoreE1 = PER¢1E1 = PER+scoreE1 = LOC¢1E1 = LOC+… scoreR1= S-of¢1R1 = S-of+….. Subject to Constraints Note:Non Sequential Model An Objective function that incorporates learned models with knowledge (constraints) A constrained Conditional Model Key Questions: How to guide the global inference? Over independently learned or pipelined models How to learn? Independently, Pipeline, Jointly? Models could be learned separately; constraints may come up only at decision time.
Penalty for violating the constraint. Weight Vector for “local” models How far y is from a “legal” assignment Features, classifiers; log-linear models (HMM, CRF) or a combination Constrained Conditional Models (Soft) constraints component How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Cutting Planes, Dual Decomposition & other search techniques are possible How to train? Training is learning the objective function Decouple? Decompose? How to exploit the structure to minimize supervision? Inferning workshop
Placing in context: a crash course in structured prediction Structured Prediction: Inference Joint features on inputs and outputs Set of allowed structures Feature Weights (estimated during learning) • Inference: given input x(a document, a sentence), predictthe best structure y= {y1,y2,…,yn} 2Y(entities & relations) • Assign values to the y1,y2,…,yn, accounting for dependencies among yis • Inference is expressed as a maximization of a scoring function y’ = argmaxy2 YwTÁ (x,y) • Inference requires, in principle, enumerating all y 2 Y at decision time, when we are given x 2 X and attempt to determine the best y 2 Y for it, given w • For some structures, inference is computationally easy. • Eg: Using the Viterbi algorithm • In general, NP-hard(can be formulated as an ILP)
Structured Prediction: Learning Learning: given a set of structured examples {(x,y)} finda scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such thatfor each given annotated example (xi, yi):
Structured Prediction: Learning Score of annotated structure Score of any other structure Penalty for predicting other structure 8 y • Learning: given a set of structured examples {(x,y)} finda scoring function w that minimizes empirical loss. • Learning is thus driven by the attempt to find a weight vector w such thatfor each given annotated example (xi, yi): • We call these conditions the learning constraints. • In most structured learning algorithms used today, the update of the weight vector w is done in an on-line fashion • W.l.o.g. (almost) we can thus write the generic structured learning algorithm as follows: • What follows is a Structured Perceptron, but with minor variations this procedure applies to CRFsand Linear Structured SVM
In the structured case, the prediction (inference) step is often intractable and needs to be done many times Structured Prediction: Learning Algorithm • For each example (xi, yi) • Do: (with the current weight vector w) • Predict: perform Inference with the current weight vector • yi’ = argmaxy2 YwTÁ( xi ,y) • Check the learning constraints • Is the score of the current prediction better than of (xi, yi)? • If Yes – a mistaken prediction • Update w • Otherwise: no need to update w on this example • EndFor
Solution I: decompose the scoring function to EASY and HARD parts Structured Prediction: Learning Algorithm EASY: could be feature functions that correspond to an HMM, a linear CRF, or a bank of classifiers (omitting dependence on y at learning time). May not be enough if the HARD part is still part of each inference step. • For each example (xi, yi) • Do: • Predict: perform Inference with the current weight vector • yi’ = argmaxy2 YwEASYTÁEASY ( xi ,y) + wHARDTÁHARD ( xi ,y) • Check the learning constraint • Is the score of the current prediction better than of (xi, yi)? • If Yes – a mistaken prediction • Update w • Otherwise: no need to update w on this example • EndDo
Solution II: Disregard some of the dependencies: assume a simple model. Structured Prediction: Learning Algorithm • For each example (xi, yi) • Do: • Predict: perform Inference with the current weight vector • yi’ = argmaxy2 YwEASYTÁEASY ( xi ,y) + wHARDTÁHARD ( xi ,y) • Check the learning constraint • Is the score of the current prediction better than of (xi, yi)? • If Yes – a mistaken prediction • Update w • Otherwise: no need to update w on this example • EndDo
Structured Prediction: Learning Algorithm Solution III: Disregard some of the dependencies during learning; take into account at decision time This is the most commonly used solution in NLP today • For each example (xi, yi) • Do: • Predict: perform Inference with the current weight vector • yi’ = argmaxy2 YwEASYTÁEASY ( xi ,y) + wHARDTÁHARD ( xi ,y) • Check the learning constraint • Is the score of the current prediction better than of (xi, yi)? • If Yes – a mistaken prediction • Update w • Otherwise: no need to update w on this example • EndDo • yi’ = argmaxy2 YwEASYTÁEASY ( xi ,y) + wHARDTÁHARD ( xi ,y)
(Soft) constraints component is more general since constraints can be declarative, non-grounded statements. Examples: CCM Formulations CCMs can be viewed as a general interface to easily combine declarative domain knowledge with data driven statistical models • Formulate NLP Problems as ILP problems (inference may be done otherwise) • 1. Sequence tagging (HMM/CRF + Global constraints) • 2. Sentence Compression (Language Model + Global Constraints) • 3. SRL (Independent classifiers + Global Constraints) Constrained Conditional Models Allow: • Learning a simple model (or multiple; or pipelines) • Make decisions with a more complex model • Accomplished by directly incorporating constraints to bias/re-rank global decisions composed of simpler models’ decisions • More sophisticated algorithmic approaches exist to bias the output [CoDL: Cheng et. al 07,12; PR: Ganchevet. al. 10; UEM: Samdani et. al 12] Sequential Prediction HMM/CRF based: Argmax ¸ij xij Sentence Compression/Summarization: Language Model based: Argmax ¸ijk xijk Linguistics Constraints Cannot have both A states and B states in an output sequence. Linguistics Constraints If a modifier chosen, include its head If verb is chosen, include its arguments
Outline • Constrained Conditional Models • A formulation for global inference with knowledge modeled as expressive structural constraints • A Structured Prediction Perspective • Decomposed Learning (DecL) • Efficient structure learning by reducing the learning-time-inference to a small output space • Provide conditions for when DecL is provably identical to global structural learning (GL)
Training Constrained Conditional Models Decompose Model • Training: • Independently of the constraints (L+I) • Jointly, in the presence of the constraints (IBT, GL) • Decomposed to simpler models • Not surprisingly, decomposition is good. • See [Chang et. al., Machine Learning Journal 2012] • Little can be said theoretically on the quality/generalization of predictions made with a decomposed model • Next, an algorithmic approach to decomposition that is both good, and comes with interesting guarantees. Decompose Model from constraints
Decomposed Structured Prediction y = argmaxy2 YwTÁ (x,y) wTÁ (xi , yi) ¸wTÁ (xi , y) + ¢ (y,yi) y 8 y • In Global Learning, the output space is exponential in the number of variables – accurate learning can be intractable • “Standard” ways to decompose it forget some of the structure and bring it back only at decision time y2 y3 y1 y2 y3 y1 y4 y4 Entity Relation Learning we wr y5 y5 y6 Inference y6 Learning is driven by the attempt to find a weight vector w such that for each given annotated example (xi , yi):
Decomposed Structural Learning (DecL) [samdani & Roth IMCL’12] y2 y3 y1 000 001 010 011 100 101 110 111 • Related work: • Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07;Pseudomax – Sontag et al, 10 00 01 10 11 y4 y5 y6 00 01 10 11 • Algorithm: Restrict the ‘argmax’ inference to a small subset of variables while fixing the remaining variables to their ground truth values yi • … and repeating for different subsets of the output variables: a decomposition • The resulting set of assignments considered for yiis called a neighborhood(yi ) • Key contribution: We give conditions under which DecL is provably equivalent to Global Learning (GL) • Show experimentally that DecL provides results close to GL when such conditions do not exactly hold
What are good neighborhoods? DecL vs. Global Learning (GL) • DecL: Separate ground truth from 8y 2nbr(yj) • 16 outputs wTÁ (xi , yi) ¸wTÁ (xi , y) + ¢ (y,yi) y2 y3 y3 y2 y1 y1 Likely scenario: nbr(yj)¿ Y 000 001 010 011 100 101 110 111 000000 000001 000010 000011 8 y 2Y 8 y 2 nbr(yj) 00 01 10 11 y4 y4 y5 y5 y6 y6 00 01 10 11 • GL: Separate ground truth from 8y 2Y • 26 = 64 outputs
Creating Decompositions • DecL allows different decompositions Sjfor different training instances yj • Example: Learning with decompositions in which all subsets of size k are considered: DecL-k • For each k-subset of the variables, enumerate; keep n-k to gold. • K=1 is Pseudomax[Sontag et al, 2010] • K=2 is Constraint Classification [Har-Peled, Zimak, Roth 2002; Crammer, Singer 2002] • In practice, neighborhoods should be determined based on domain knowledge • Put highly coupled variables in the same set • The goal is to get results that are close to doing exact inference. Are there small & good neighborhoods?
Score of ground truth yj Score of all non ground truth y Exactness of DecL Different label space yj W* Wdecl w • Key Result: YES.Under “reasonable conditions”, DecL with small sized neighborhoods nbr(yj) gives the same results as Global Learning. • For analyzing the equivalence between DecL and GL, we need a notion of ‘separability’ of the data • Separability: existence of set of weights W* that satisfy W*: {w | w¢Á(xj, yj)¸w¢Á(xj, y)+ ¢(yj,y), 8y 2Y} • Separating weights for DecL Wdecl: {w| w¢Á(xj, yj)¸w¢Á(xj, y)+ ¢(yj,y), 8y 2nbr(yj)} • Naturally: W* µ Wdecl • Exactness Results: The set of separating weights for DecL is equal to the set of separating weights for GL: W*=Wdecl
Example of Exactness: Pairwise Markov Networks y2 Pairwise/Edge components Singleton/Vertex components y3 y1 y4 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 > < OR y5 y6 • Scoring function define over a graph with edges E • Assume domain knowledge on W*:for correct (separating) w 2W*,which of the pairwise Ái,k(.;w)are: • Submodular:Ái,k(0,0)+ Ái,k(1,1) >Ái,k(0,1) + Ái,k(1,0) • Supermodular:Ái,k(0,0)+ Ái,k(1,1) <Ái,k(0,1) + Ái,k(1,0)
Decomposition for Pairwise Markov Networks 0 1 X X sub(Á) sup(Á) X E Ej For an example (xj,yj), define Ejby removing edges from E where the labels disagree with the Á‘s Theorem: Decomposing the variables as connected components of Ejyields Exactness.
Experiments: Information extraction Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Prediction result of a trained HMM Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Violates lots of natural constraints!
Adding Expressivity via Constraints Each field must be aconsecutive list of words and can appear at mostoncein a citation. State transitions must occur onpunctuation marks. The citation can only start withAUTHORorEDITOR. The wordspp., pagescorrespond toPAGE. Four digits starting with20xx and 19xx areDATE. Quotationscan appear only inTITLE …….
Information Extraction with Constraints Experimental Goal: • Investigate DecL with small neighborhoods • Note that the required theoretical conditions hold only approximately: • Output tokens tend to appear in contiguous blocks • Use neighborhoods similar to the PMN • Adding constraints, we getcorrectresults! • [AUTHOR]Lars Ole Andersen . [TITLE]Program analysis andspecialization for the C Programming language . [TECH-REPORT] PhD thesis . [INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .
Typical Results: Information Extraction (Ads Data) F1 Scores
Typical Results: Information Extraction (Ads Data) F1 Scores Time taken to train (Minutes)
Conclusion Thank You! Check out our tools, demos, tutorials • Presented Constrained Conditional Models: • An ILP formulation for structured prediction that augments statistically learned modelswith declarative constraints as a way to incorporate knowledge and support decisions in expressive output spaces • Supports joint inference while maintaining modularity and tractability of training • Interdependent components are learned (independentlyor pipelined) and, via joint inference, support coherent decisions, modulo declarative constraints. • Presented Decomposed Learning (DecL): efficient joint learning by reducing the learning-time-inference to a small output space • Provided conditions for when DecL is provably identical to global structural learning (GL) • Interesting open questions in developing further understanding of how to support efficient joint inference