1.83k likes | 2.12k Views
Constrained Conditional Models for Natural Language Processing. Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign. March 2009 EACL. Nice to Meet You. Page 2. Informally:
E N D
Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign March 2009 EACL
Nice to Meet You Page 2
Informally: Everything that has to do with constraints (and learning models) Formally: We typically make decisions based on models such as: With CCMs we make decisions based on models such as: We do not define the learning method but we’ll discuss it and make suggestions CCMs make predictions in the presence/guided by constraints Constraints Conditional Models (CCMs) Page 3
Constraints Driven Learning and Decision Making • Why Constraints? • The Goal: Building a good NLP systems easily • We have prior knowledge at our hand • How can we use it? • We suggest that often knowledge can be injected directly • Can use it to guide learning • Can use it to improve decision making • Can use it to simplify the models we need to learn • How useful are constraints? • Useful for supervised learning • Useful for semi-supervised learning • Sometimes more efficient than labeling data directly
A process that maintains and updates a collection of propositions about the state of affairs. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. This is an Inference Problem
This Tutorial: Constrained Conditional Models • Part 1: Introduction to CCMs • Examples: • NE + Relations • Information extraction – correcting models with CCMS • First summary: why are CCM important • Problem Setting • Features and Constraints; Some hints about training issues • Part 2: Introduction to Integer Linear Programming • What is ILP; use of ILP in NLP • Part 3: Detailed examples of using CCMs • Semantic Role Labeling in Details • Coreference Resolution • Sentence Compression BREAK
This Tutorial: Constrained Conditional Models (2nd part) • Part 4: More on Inference – • Other inference algorithms • Search (SRL); Dynamic Programming (Transliteration); Cutting Planes • Using hard and soft constraints • Part 5: Training issues when working with CCMS • Formalism (again) • Choices of training paradigms -- Tradeoffs • Examples in Supervised learning • Examples in Semi-Supervised learning • Part 6: Conclusion • Building CCMs • Features and Constraints; Objective functions; Different Learners. • Mixed models vs. Joint models; where is Knowledge coming from THE END Page 8
Learning and Inference Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. E.g. Structured Output Problems – multiple dependent output variables (Learned) models/classifiers for different sub-problems In some cases, not all local models can be learned simultaneously Key examples in NLP are Textual Entailment and QA In these cases, constraints may appear only at evaluation time Incorporate models’ information, along with prior knowledge/constraints, in making coherent decisions decisions that respect the local models as well as domain & context specific knowledge/constraints. Page 9
This Tutorial: Constrained Conditional Models • Part 1: Introduction to CCMs • Examples: • NE + Relations • Information extraction – correcting models with CCMS • First summary: why are CCM important • Problem Setting • Features and Constraints; Some hints about training issues • Part 2: Introduction to Integer Linear Programming • What is ILP; use of ILP in NLP • Semantic Role Labeling in Details • Part 3: Detailed examples of using CCMs • Coreference Resolution • Sentence Compression BREAK
This Tutorial: Constrained Conditional Models (2nd part) • Part 4: More on Inference – • Other inference algorithms • Search (SRL); Dynamic Programming (Transliteration); Cutting Planes • Using hard and soft constraints • Part 5: Training issues when working with CCMS • Formalism (again) • Choices of training paradigms -- Tradeoffs • Examples in Supervised learning • Examples in Semi-Supervised learning • Part 6: Conclusion • Building CCMs • Features and Constraints; Objective functions; Different Learners. • Mixed models vs. Joint models; where is Knowledge coming from THE END Page 11
Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 Improvement over no inference: 2-5% Inference with General Constraint Structure [Roth&Yih’04]Recognizing Entities and Relations Some Questions: How to guide the global inference? Why not learn Jointly? Note: Non Sequential Model Models could be learned separately; constraints may come up only at decision time.
Task of Interests: Structured Output • For each instance, assign values to a set of variables • Output variables depend on each other • Common tasks in • Natural language processing • Parsing; Semantic Parsing; Summarization; Transliteration; Co-reference resolution,… • Information extraction • Entities, Relations,… • Many pure machine learning approaches exist • Hidden Markov Models (HMMs); CRFs • Structured Perceptrons ad SVMs… • However, …
Motivation II Information Extraction via Hidden Markov Models Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Prediction result of a trained HMM Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Unsatisfactory results ! Page 14
Strategies for Improving the Results Increasing the model complexity Can we keep the learned model simple and still make expressive decisions? • (Pure) Machine Learning Approaches • Higher Order HMM/CRF? • Increasing the window size? • Adding a lot of new features • Requires a lot of labeled examples • What if we only have a few labeled examples? • Any other options? • Humans can immediately tell bad outputs • The output does not make sense
Prediction result of a trained HMM Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Information extraction without Prior Knowledge Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Violates lots of natural constraints! Page 16
Examples of Constraints Easy to express pieces of “knowledge” Non Propositional; May use Quantifiers Each field must be aconsecutive list of words and can appear at mostoncein a citation. State transitions must occur onpunctuation marks. The citation can only start withAUTHORorEDITOR. The wordspp., pagescorrespond toPAGE. Four digits starting with20xx and 19xx areDATE. Quotationscan appear only inTITLE …….
Information Extraction with Constraints • Adding constraints, we getcorrectresults! • Without changing the model • [AUTHOR]Lars Ole Andersen . [TITLE]Program analysis andspecialization for the C Programming language . [TECH-REPORT] PhD thesis . [INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 . Page 18
y1 y2 y3 C(y2,y3,y6,y7,y8) C(y1,y4) y4 y5 y6 y8 (+ WC) Problem Setting • Random Variables Y: • Conditional DistributionsP (learned by models/classifiers) • Constraints C– any Boolean function defined over partial assignments (possibly: + weights W ) • Goal: Find the “best” assignment • The assignment that achieves the highest global performance. • This is an Integer Programming Problem y7 observations Y*=argmaxYPY subject to constraints C
Penalty for violating the constraint. Weight Vector for “local” models How far y is from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination Formal Model Subject to constraints (Soft) constraints component How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?
Features Versus Constraints Indeed, used differently • Ái : X £ Y ! R; Ci : X £ Y ! {0,1}; d: X £ Y ! R; • In principle, constraints and features can encode the same propeties • In practice, they are very different • Features • Local , short distance properties – to allow tractable inference • Propositional (grounded): • E.g. True if: “the” followed by a Noun occurs in the sentence” • Constraints • Global properties • Quantified, first order logic expressions • E.g.True if: “all yis in the sequence y are assigned different values.” Page 21
Encoding Prior Knowledge Need more training data A form of supervision • Consider encoding the knowledge that: • Entities of type A and B cannot occur simultaneously in a sentence • The “Feature” Way • Results in higher order HMM, CRF • May require designing a model tailored to knowledge/constraints • Large number of new features: might require more labeled data • Wastes parameters to learn indirectly knowledgewe have. • The Constraints Way • Keeps the model simple; add expressive constraints directly • A small set of constraints • Allows for decision time incorporation of constraints Page 22
Constrained Conditional Models – 1st Summary • Everything that has to do with Constraints and Learning models • In both examples, we started with learning models • Either for components of the problem • Classifiers for Relations and Entities • Or the whole problem • Citations • We then included constraints on the output • As a way to “correct” the output of the model • In both cases this allows us to • Learn simpler models than we would otherwise • As presented, global constraints did not take part in training • Global constraints were used only at the output. • We will later call this training paradigm L+I Page 23
This Tutorial: Constrained Conditional Models • Part 1: Introduction to CCMs • Examples: • NE + Relations • Information extraction – correcting models with CCMS • First summary: why are CCM important • Problem Setting • Features and Constraints; Some hints about training issues • Part 2: Introduction to Integer Linear Programming • What is ILP; use of ILP in NLP • Semantic Role Labeling in Details • Part 3: Detailed examples of using CCMs • Coreference Resolution • Sentence Compression BREAK
This Tutorial: Constrained Conditional Models (2nd part) • Part 4: More on Inference – • Other inference algorithms • Search (SRL); Dynamic Programming (Transliteration); Cutting Planes • Using hard and soft constraints • Part 5: Training issues when working with CCMS • Formalism (again) • Choices of training paradigms -- Tradeoffs • Examples in Supervised learning • Examples in Semi-Supervised learning • Part 6: Conclusion • Building CCMs • Features and Constraints; Objective functions; Different Learners. • Mixed models vs. Joint models; where is Knowledge coming from THE END Page 25
Penalty for violating the constraint. Weight Vector for “local” models How far y is from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination Formal Model (Soft) constraints component How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?
Inference with constraints. We start with adding constraints to existing models. It’s a good place to start, because conceptually all you do is to add constraints to what you were doing before and the performance improves.
Constraints and Integer Linear Programming (ILP) ILP is powerful (NP-complete) ILP is popular – inference for many models, such as Viterbi for CFR, have already been implemented. Powerful off-the shelf solvers exist. All we need is to write down the objective function and the constraints, there is no need to write code.
Linear Programming Key contributors: Leonid Kantorovich, George B. Dantzig, John von Neumann. Optimization technique with linear objective function and linear constraints. Note that the word “Integer” is absent.
Example (Thanks James Clarke) • Telfa Co. produces tables and chairs • Each table makes 8$ profit, each chair makes 5$ profit. • We want to maximize the profit.
Example (Thanks James Clarke) • Telfa Co. produces tables and chairs • Each table makes 8$ profit, each chair makes 5$ profit. • A table requires 1 hour of labor and 9 sq. feet of wood. • A chair requires 1 hour of labor and 5 sq. feet of wood. • We have only 6 hours of work and 45sq. Feet of wood. • We want to maximize the profit.
Integer Linear Programming. • In NLP, we are dealing with discrete outputs, therefore we’re almost always interested in integer solutions. • ILP is NP-complete, but often efficient for large NLP problems. • In some cases, the solutions to LP are integral (e.g totally unimodular constraint matrix.). • Next, we show an example for using ILP with constraints. The matrix is not totally unimodular, but LP gives integral solutions.
Back to Example: Recognizing Entities and Relations Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 Page 40
Back to Example: Recognizing Entities and Relations Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 • NLP with ILP- key issues: • Write down the objective function. • Write down the constraints as linear inequalities Page 41
Back to Example: Recognizing Entities and Relations Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 Page 42
Back to Example: cost function Dole Elizabeth N.C. E1 E2 E3 R12 R21 R23 R32 R13 R31 x{E1 = per}, x{E1 = loc}, …, x{R12 = spouse_of}, x{R12 = born_in}, …, x{R12 = } , … {0,1} Page 43
Back to Example: cost function Dole Elizabeth N.C. E1 E2 E3 R12 R21 R23 R32 R13 R31 Cost function: c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + … Page 44
Adding Constraints • Each entity is either a person, organization or location: • x{E1 = per}+ x{E1 = loc}+ x{E1 = org} + x{E1 = }=1 • (R12 = spouse_of) (E1 = person) (E2 = person) • x{R12 = spouse_of} x{E1 = per} • x{R12 = spouse_of} x{E2 = per} • We need more consistency constraints. • Any Boolean constraint can be written with a set of linear inequalities, and an efficient algorithm exist [Rizzollo2007].
CCM for NE-Relations c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + … • We showed how a CCM formulation for the NE-Relation problem. • In this case, the cost of each variable was learned independently, using trained classifiers. • Other expressive problems can be formulated as Integer Linear Programs. • For example, HMM/CRF inference, Viterbi algorithms
TSP as an Integer Linear Program • Dantzig et al. were the first to suggest a practical solution to the problem using ILP.
Reduction from Traveling Salesman to ILP 1 x12c12 3 Every node has ONE outgoing edge x32c32 2
Reduction from Traveling Salesman to ILP 1 x12c12 3 x32c32 2 Every node has ONE incoming edge
Reduction from Traveling Salesman to ILP 1 x12c12 3 x32c32 2 The solutions are binary (int)