1 / 183

Constrained Conditional Models for Natural Language Processing

Constrained Conditional Models for Natural Language Processing. Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign. March 2009 EACL. Nice to Meet You. Page 2. Informally:

anne
Download Presentation

Constrained Conditional Models for Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Constrained Conditional Models for Natural Language Processing Ming-Wei Chang, Lev Ratinov, Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign March 2009 EACL

  2. Nice to Meet You Page 2

  3. Informally: Everything that has to do with constraints (and learning models) Formally: We typically make decisions based on models such as: With CCMs we make decisions based on models such as: We do not define the learning method but we’ll discuss it and make suggestions CCMs make predictions in the presence/guided by constraints Constraints Conditional Models (CCMs) Page 3

  4. Constraints Driven Learning and Decision Making • Why Constraints? • The Goal: Building a good NLP systems easily • We have prior knowledge at our hand • How can we use it? • We suggest that often knowledge can be injected directly • Can use it to guide learning • Can use it to improve decision making • Can use it to simplify the models we need to learn • How useful are constraints? • Useful for supervised learning • Useful for semi-supervised learning • Sometimes more efficient than labeling data directly

  5. Inference

  6. A process that maintains and updates a collection of propositions about the state of affairs. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. This is an Inference Problem

  7. This Tutorial: Constrained Conditional Models • Part 1: Introduction to CCMs • Examples: • NE + Relations • Information extraction – correcting models with CCMS • First summary: why are CCM important • Problem Setting • Features and Constraints; Some hints about training issues • Part 2: Introduction to Integer Linear Programming • What is ILP; use of ILP in NLP • Part 3: Detailed examples of using CCMs • Semantic Role Labeling in Details • Coreference Resolution • Sentence Compression BREAK

  8. This Tutorial: Constrained Conditional Models (2nd part) • Part 4: More on Inference – • Other inference algorithms • Search (SRL); Dynamic Programming (Transliteration); Cutting Planes • Using hard and soft constraints • Part 5: Training issues when working with CCMS • Formalism (again) • Choices of training paradigms -- Tradeoffs • Examples in Supervised learning • Examples in Semi-Supervised learning • Part 6: Conclusion • Building CCMs • Features and Constraints; Objective functions; Different Learners. • Mixed models vs. Joint models; where is Knowledge coming from THE END Page 8

  9. Learning and Inference Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. E.g. Structured Output Problems – multiple dependent output variables (Learned) models/classifiers for different sub-problems In some cases, not all local models can be learned simultaneously Key examples in NLP are Textual Entailment and QA In these cases, constraints may appear only at evaluation time Incorporate models’ information, along with prior knowledge/constraints, in making coherent decisions decisions that respect the local models as well as domain & context specific knowledge/constraints. Page 9

  10. This Tutorial: Constrained Conditional Models • Part 1: Introduction to CCMs • Examples: • NE + Relations • Information extraction – correcting models with CCMS • First summary: why are CCM important • Problem Setting • Features and Constraints; Some hints about training issues • Part 2: Introduction to Integer Linear Programming • What is ILP; use of ILP in NLP • Semantic Role Labeling in Details • Part 3: Detailed examples of using CCMs • Coreference Resolution • Sentence Compression BREAK

  11. This Tutorial: Constrained Conditional Models (2nd part) • Part 4: More on Inference – • Other inference algorithms • Search (SRL); Dynamic Programming (Transliteration); Cutting Planes • Using hard and soft constraints • Part 5: Training issues when working with CCMS • Formalism (again) • Choices of training paradigms -- Tradeoffs • Examples in Supervised learning • Examples in Semi-Supervised learning • Part 6: Conclusion • Building CCMs • Features and Constraints; Objective functions; Different Learners. • Mixed models vs. Joint models; where is Knowledge coming from THE END Page 11

  12. Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 Improvement over no inference: 2-5% Inference with General Constraint Structure [Roth&Yih’04]Recognizing Entities and Relations Some Questions: How to guide the global inference? Why not learn Jointly? Note: Non Sequential Model Models could be learned separately; constraints may come up only at decision time.

  13. Task of Interests: Structured Output • For each instance, assign values to a set of variables • Output variables depend on each other • Common tasks in • Natural language processing • Parsing; Semantic Parsing; Summarization; Transliteration; Co-reference resolution,… • Information extraction • Entities, Relations,… • Many pure machine learning approaches exist • Hidden Markov Models (HMMs)‏; CRFs • Structured Perceptrons ad SVMs… • However, …

  14. Motivation II Information Extraction via Hidden Markov Models Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Prediction result of a trained HMM Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Unsatisfactory results ! Page 14

  15. Strategies for Improving the Results Increasing the model complexity Can we keep the learned model simple and still make expressive decisions? • (Pure) Machine Learning Approaches • Higher Order HMM/CRF? • Increasing the window size? • Adding a lot of new features • Requires a lot of labeled examples • What if we only have a few labeled examples? • Any other options? • Humans can immediately tell bad outputs • The output does not make sense

  16. Prediction result of a trained HMM Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Information extraction without Prior Knowledge Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Violates lots of natural constraints! Page 16

  17. Examples of Constraints Easy to express pieces of “knowledge” Non Propositional; May use Quantifiers Each field must be aconsecutive list of words and can appear at mostoncein a citation. State transitions must occur onpunctuation marks. The citation can only start withAUTHORorEDITOR. The wordspp., pagescorrespond toPAGE. Four digits starting with20xx and 19xx areDATE. Quotationscan appear only inTITLE …….

  18. Information Extraction with Constraints • Adding constraints, we getcorrectresults! • Without changing the model • [AUTHOR]Lars Ole Andersen . [TITLE]Program analysis andspecialization for the C Programming language . [TECH-REPORT] PhD thesis . [INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 . Page 18

  19. y1 y2 y3 C(y2,y3,y6,y7,y8) C(y1,y4) y4 y5 y6 y8 (+ WC) Problem Setting • Random Variables Y: • Conditional DistributionsP (learned by models/classifiers) • Constraints C– any Boolean function defined over partial assignments (possibly: + weights W ) • Goal: Find the “best” assignment • The assignment that achieves the highest global performance. • This is an Integer Programming Problem y7 observations Y*=argmaxYPY subject to constraints C

  20. Penalty for violating the constraint. Weight Vector for “local” models How far y is from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination Formal Model Subject to constraints (Soft) constraints component How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?

  21. Features Versus Constraints Indeed, used differently • Ái : X £ Y ! R; Ci : X £ Y ! {0,1}; d: X £ Y ! R; • In principle, constraints and features can encode the same propeties • In practice, they are very different • Features • Local , short distance properties – to allow tractable inference • Propositional (grounded): • E.g. True if: “the” followed by a Noun occurs in the sentence” • Constraints • Global properties • Quantified, first order logic expressions • E.g.True if: “all yis in the sequence y are assigned different values.” Page 21

  22. Encoding Prior Knowledge Need more training data A form of supervision • Consider encoding the knowledge that: • Entities of type A and B cannot occur simultaneously in a sentence • The “Feature” Way • Results in higher order HMM, CRF • May require designing a model tailored to knowledge/constraints • Large number of new features: might require more labeled data • Wastes parameters to learn indirectly knowledgewe have. • The Constraints Way • Keeps the model simple; add expressive constraints directly • A small set of constraints • Allows for decision time incorporation of constraints Page 22

  23. Constrained Conditional Models – 1st Summary • Everything that has to do with Constraints and Learning models • In both examples, we started with learning models • Either for components of the problem • Classifiers for Relations and Entities • Or the whole problem • Citations • We then included constraints on the output • As a way to “correct” the output of the model • In both cases this allows us to • Learn simpler models than we would otherwise • As presented, global constraints did not take part in training • Global constraints were used only at the output. • We will later call this training paradigm L+I Page 23

  24. This Tutorial: Constrained Conditional Models • Part 1: Introduction to CCMs • Examples: • NE + Relations • Information extraction – correcting models with CCMS • First summary: why are CCM important • Problem Setting • Features and Constraints; Some hints about training issues • Part 2: Introduction to Integer Linear Programming • What is ILP; use of ILP in NLP • Semantic Role Labeling in Details • Part 3: Detailed examples of using CCMs • Coreference Resolution • Sentence Compression BREAK

  25. This Tutorial: Constrained Conditional Models (2nd part) • Part 4: More on Inference – • Other inference algorithms • Search (SRL); Dynamic Programming (Transliteration); Cutting Planes • Using hard and soft constraints • Part 5: Training issues when working with CCMS • Formalism (again) • Choices of training paradigms -- Tradeoffs • Examples in Supervised learning • Examples in Semi-Supervised learning • Part 6: Conclusion • Building CCMs • Features and Constraints; Objective functions; Different Learners. • Mixed models vs. Joint models; where is Knowledge coming from THE END Page 25

  26. Penalty for violating the constraint. Weight Vector for “local” models How far y is from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination Formal Model (Soft) constraints component How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?

  27. Inference with constraints. We start with adding constraints to existing models. It’s a good place to start, because conceptually all you do is to add constraints to what you were doing before and the performance improves.

  28. Constraints and Integer Linear Programming (ILP) ILP is powerful (NP-complete) ILP is popular – inference for many models, such as Viterbi for CFR, have already been implemented. Powerful off-the shelf solvers exist. All we need is to write down the objective function and the constraints, there is no need to write code.

  29. Linear Programming Key contributors: Leonid Kantorovich, George B. Dantzig, John von Neumann. Optimization technique with linear objective function and linear constraints. Note that the word “Integer” is absent.

  30. Example (Thanks James Clarke) • Telfa Co. produces tables and chairs • Each table makes 8$ profit, each chair makes 5$ profit. • We want to maximize the profit.

  31. Example (Thanks James Clarke) • Telfa Co. produces tables and chairs • Each table makes 8$ profit, each chair makes 5$ profit. • A table requires 1 hour of labor and 9 sq. feet of wood. • A chair requires 1 hour of labor and 5 sq. feet of wood. • We have only 6 hours of work and 45sq. Feet of wood. • We want to maximize the profit.

  32. Solving LP problems.

  33. Solving LP problems.

  34. Solving LP problems.

  35. Solving LP problems.

  36. Solving LP problems

  37. Solving LP Problems.

  38. Integer Linear Programming- integer solutions.

  39. Integer Linear Programming. • In NLP, we are dealing with discrete outputs, therefore we’re almost always interested in integer solutions. • ILP is NP-complete, but often efficient for large NLP problems. • In some cases, the solutions to LP are integral (e.g totally unimodular constraint matrix.). • Next, we show an example for using ILP with constraints. The matrix is not totally unimodular, but LP gives integral solutions.

  40. Back to Example: Recognizing Entities and Relations Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 Page 40

  41. Back to Example: Recognizing Entities and Relations Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 • NLP with ILP- key issues: • Write down the objective function. • Write down the constraints as linear inequalities Page 41

  42. Back to Example: Recognizing Entities and Relations Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 Page 42

  43. Back to Example: cost function Dole Elizabeth N.C. E1 E2 E3 R12 R21 R23 R32 R13 R31 x{E1 = per}, x{E1 = loc}, …, x{R12 = spouse_of}, x{R12 = born_in}, …, x{R12 = } , … {0,1} Page 43

  44. Back to Example: cost function Dole Elizabeth N.C. E1 E2 E3 R12 R21 R23 R32 R13 R31 Cost function: c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + … Page 44

  45. Adding Constraints • Each entity is either a person, organization or location: • x{E1 = per}+ x{E1 = loc}+ x{E1 = org} + x{E1 = }=1 • (R12 = spouse_of)  (E1 = person)  (E2 = person) • x{R12 = spouse_of} x{E1 = per} • x{R12 = spouse_of} x{E2 = per} • We need more consistency constraints. • Any Boolean constraint can be written with a set of linear inequalities, and an efficient algorithm exist [Rizzollo2007].

  46. CCM for NE-Relations c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + … • We showed how a CCM formulation for the NE-Relation problem. • In this case, the cost of each variable was learned independently, using trained classifiers. • Other expressive problems can be formulated as Integer Linear Programs. • For example, HMM/CRF inference, Viterbi algorithms

  47. TSP as an Integer Linear Program • Dantzig et al. were the first to suggest a practical solution to the problem using ILP.

  48. Reduction from Traveling Salesman to ILP 1 x12c12 3 Every node has ONE outgoing edge x32c32 2

  49. Reduction from Traveling Salesman to ILP 1 x12c12 3 x32c32 2 Every node has ONE incoming edge

  50. Reduction from Traveling Salesman to ILP 1 x12c12 3 x32c32 2 The solutions are binary (int)

More Related