800 likes | 977 Views
Learning and Inference for Natural Language Understanding. Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign. With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser ,
E N D
Learning and InferenceforNatural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign With thanks to: Collaborators:Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser, Lev Ratinov, Vivek Srikumar, Many others Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP) February 2012 Princeton Plasma Physics Laboratory
Learning and Inference in Natural Language • Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. • Structured Output Problems – multiple dependent output variables • (Learned) models/classifiers for different sub-problems • In some cases, not all local models can be learned simultaneously • In these cases, constraints may appear only at evaluation time • Incorporate models’ information, along with prior knowledge (constraints), in making coherent decisions • decisions that respect the local models as well as domain & context specific knowledge/constraints.
A process that maintains and updates a collection of propositions about the state of affairs. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. This is an Inference Problem
Variability Ambiguity Why is it difficult? Meaning Language
Context Sensitive Paraphrasing Loosen Strengthen Step up Toughen Improve Fasten Impose Intensify Ease Beef up Simplify Curb Reduce Loosen Strengthen Step up Toughen Improve Fasten Impose Intensify Ease Beef up Simplify Curb Reduce He used a Phillips head to tighten the screw. The bank owner tightened security after a spat of local crimes. The Federal Reserve will aggressively tighten monetary policy.
Variability in Natural Language Expressions Example: Relation Extraction: “Works_for” Jim Carpenter works for the U.S. Government. The American government employed Jim Carpenter. Jim Carpenter was fired by the US Government. Jim Carpenter worked in a number of important positions. …. As a press liaison for the IRS, he made contacts in the white house. Top Russian interior minister Yevgeny Topolov met yesterday with his US counterpart, Jim Carpenter. Former US Secretary of Defence Jim Carpenter spoke today…
Textual Entailment A key problem in natural language understanding is to abstract over the inherent syntactic and semantic variability in natural language. Is it true that…? (Textual Entailment) Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year Yahoo acquired Overture Overture is a search company Google is a search company Google owns Overture ……….
Why? • A law office wants to get the list of all people that were mentioned in email correspondence with the office. • For each name, determine whether is was mentioned adversarially or not. • A political scientist studies Climate Change and its effect on Societal instability. He wants to identify all events related to demonstrations, protests, parades, elections, analyze them (who, when, where, why) and generate a timeline • An electronic health record (EHR) is a personal health record in digital format. Includes information relating to: • Current and historical health, medical conditions and medical tests; medical referrals, treatments, medications, demographic information etc. • Today: a write only document • Can we use it in medical advice systems; medication selection and tracking (Vioxx…); disease outbreak and control; science – correlating response to drugs with other conditions
Background • It’s difficult to program predicates of interest due to • Ambiguity (everything has multiple meanings) • Variability (everything you want to say you can say in many ways) • Consequently: all of Natural Language Processing is driven by Statistical Machine Learning • Even simple predicates like: • What is the part of speech of the word “can”, or • Correct: I’d like a peace of cake • Not to mention harder problems like co-reference resolution, parsing, semantic parsing, named entity recognition,… • Machine Learning problems in NLP are large: • Often, 106 features (due to lexical items, conjunctions of them, etc.) • We are pretty good for some classes of problems.
Classification: Ambiguity Resolution Illinois’ bored of education [board] Nissan Car and truck plant; plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. Hewas taken to a veterinarian; a hospital Tiger was in Washington for the PGA Tour Finance; Banking; World News; Sports Important or not important; love or hate
Classification: learn a function f: X Ythat maps observations in a domain to one of several categories. Classification is Well Understood • Theoretically: generalization bounds • How many example does one need to see in order to guarantee good behavior on previously unobserved examples. • Algorithmically: good learning algorithms for linear representations. • Can deal with very high dimensionality (106 features) • Very efficient in terms of computation and # of examples. On-line. • Key issues remaining: • Learning protocols: how to minimize interaction (supervision); how to map domain/task information to supervision; semi-supervised learning; active learning; ranking; adaptation to new domains. • What are the features? No good theoretical understanding here. • Is it sufficient for making progress in NLP?
A process that maintains and updates a collection of propositions about the state of affairs. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. This is an Inference Problem
Coherency in Semantic Role Labeling Predicate-arguments generated should be consistent across phenomena The touchdown scoredbyCutler cemented the victoryof the Bears. Linguistic Constraints: A0: the Bears Sense(of): 11(6) A0: Cutler Sense(by): 1(1)
Semantic Parsing Y: largest( state( next_to( state(NY) AND next_to (state(MD)))) X :“What is the largest state that borders New York and Maryland ?" • Successful interpretation involves multiple decisions • What entities appear in the interpretation? • “New York” refers to a state or a city? • How to compose fragments together? • state(next_to()) >< next_to(state())
Learning and Inference • Natural Language Decisions are Structured • Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. • It is essential to make coherent decisions in a way that takes the interdependencies into account. Joint, Global Inference. • Today: • How to support Structured Predictions in NLP • Using declarative constraints in Natural Language Processing • Learning and Inference issues • Mostly Examples
Outline • Background: • Natural Language Processing: problems and difficulties • Global Inference with expressive structural constraints in NLP • Constrained Conditional Models • Some Learning Issues in the presence of minimal supervision • Constraints Driven Learning • Learning with Indirect Supervision • Response based Learning • More Examples
Statistics or Linguistics? • Statistical approaches were very successful in NLP • But, it has become clear that there is a need to move from strictly Data Driven approaches to Knowledge Driven approaches • Knowledge: Linguistics, Background world knowledge • How to incorporate Knowledge into Statistical Learning & Decision Making? • In many respects Structured Prediction addresses this question. • This also distinguishes it from the “standard” study of probabilistic models.
Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 How to guide the global inference? Why not learn jointly? Inference with General Constraint StructureRecognizing Entities and Relations
Pipeline • Conceptually, Pipelining is a crude approximation • Interactions occur across levels and down stream decisions often interact with previous decisions. • Leads to propagation of errors • Occasionally, later stage problems are easier but cannot correct earlier errors. • But, there are good reasons to use pipelines • Putting everything in one basket may not be right • How about choosing some stages and think about them jointly? Raw Data • Most problems are not single classification problems POS Tagging Phrases Semantic Entities Relations Parsing WSD Semantic Role Labeling
How to express the constraints on the decisions? How to “enforce” them? Semantic Role Labeling Who did what to whom, when, where, why,… I left my pearls to my daughter in my will . [I]A0left[my pearls]A1[to my daughter]A2[in my will]AM-LOC . • A0 Leaver • A1 Things left • A2 Benefactor • AM-LOC Location I left my pearls to my daughter in my will . Overlapping arguments If A2 is present, A1 must also be present.
Semantic Role Labeling (2/2) • PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. • It adds a layer of generic semantic labels to Penn Tree Bank II. • (Almost) all the labels are on the constituents of the parse trees. • Core arguments: A0-A5 and AA • different semantics for each verb • specified in the PropBank Frame files • 13 types of adjuncts labeled as AM-arg • where arg specifies the adjunct type
I left my nice pearls to her I left my nice pearls to her I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ [ [ [ [ [ ] ] ] ] ] ] ] ] ] ] Algorithmic Approach candidate arguments • Identify argument candidates • Pruning [Xue&Palmer, EMNLP’04] • Argument Identifier • Binary classification (A-Perc) • Classify argument candidates • Argument Classifier • Multi-class classification (A-Perc) • Inference • Use the estimated probability distribution given by the argument classifier • Use structural and linguistic constraints • Infer the optimal global output Ileftmy nice pearlsto her
Semantic Role Labeling (SRL) I left my pearls to my daughter in my will . Page 25
Semantic Role Labeling (SRL) I left my pearls to my daughter in my will . Page 26
Semantic Role Labeling (SRL) I left my pearls to my daughter in my will . One inference problem for each verb predicate. Page 27
Integer Linear Programming Inference • For each argument ai • Set up a Boolean variable: ai,tindicating whether ai is classified as t • Goal is to maximize • i score(ai = t ) ai,t • Subject to the (linear) constraints • If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints. The Constrained Conditional Model is completely decomposed during training
Constraints Any Boolean rule can be encoded as a (collection of) linear constraints. • No duplicate argument classes aPOTARG x{a = A0} 1 • R-ARG a2POTARG , aPOTARG x{a = A0}x{a2 = R-A0} • C-ARG • a2POTARG , (aPOTARG) (a is before a2 )x{a = A0}x{a2 = C-A0} • Many other possible constraints: • Unique labels • No overlapping or embedding • Relations between number of arguments; order constraints • If verb is of type A, no argument of type B If there is an R-ARG phrase, there is an ARG Phrase If there is an C-ARG phrase, there is an ARG before it Universally quantified rules LBJ: allows a developer to encode constraints in FOL to be compiled into linear inequalities automatically. Joint inference can be used also to combine different (SRL) Systems.
SRL: Formulation & Outcomes Demo:http://cogcomp.cs.illinois.edu/page/demos 2) Produces a very good semantic parser. F1~90% 3) Easy and fast: ~7 Sent/Sec (using Xpress-MP) Top ranked system in CoNLL’05 shared task Key difference is the Inference 2:30
Penalty for violating the constraint. Weight Vector for “local” models How far y is from a “legal” assignment Features, classifiers; log-linear models (HMM, CRF) or a combination Constrained Conditional Models (aka ILP Inference) (Soft) constraints component How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Cutting Planes, Dual Decomposition & other search techniques are possible How to train? Training is learning the objective function Decouple? Decompose? How to exploit the structure to minimize supervision?
Three Ideas Modeling • Idea 1: Separate modeling and problem formulation from algorithms • Similar to the philosophy of probabilistic modeling • Idea 2: Keep model simple, make expressive decisions (via constraints) • Unlike probabilistic modeling, where models become more expressive • Idea 3: Expressive structured decisions can be supervised indirectly via related simple binary decisions • Global Inference can be used to amplify the minimal supervision. Inference Learning
Examples: CCM Formulations (aka ILP for NLP) CCMs can be viewed as a general interface to easily combine declarative domain knowledge with data driven statistical models • Formulate NLP Problems as ILP problems (inference may be done otherwise) • 1. Sequence tagging (HMM/CRF + Global constraints) • 2. Sentence Compression (Language Model + Global Constraints) • 3. SRL (Independent classifiers + Global Constraints) Sequential Prediction HMM/CRF based: Argmax ¸ij xij Sentence Compression/Summarization: Language Model based: Argmax ¸ijk xijk Linguistics Constraints Cannot have both A states and B states in an output sequence. Linguistics Constraints If a modifier chosen, include its head If verb is chosen, include its arguments
Outline • Background: • Natural Language Processing: problems and difficulties • Global Inference with expressive structural constraints in NLP • Constrained Conditional Models • Some Learning Issues in the presence of minimal supervision • Constraints Driven Learning • Learning with Indirect Supervision • Response based Learning • More Examples Learning structured models requires annotating structures Interdependencies among decision variables should be exploited in Inference & Learning. Goal: learn from minimal, indirect supervision. Amplify it using variables’ interdependencies
Information extraction without Prior Knowledge Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Prediction result of a trained HMM Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Violates lots of natural constraints!
Strategies for Improving the Results Increasing the model complexity Increase difficulty of Learning Can we keep the learned model simple and still make expressive decisions? • (Pure) Machine Learning Approaches • Higher Order HMM/CRF? • Increasing the window size? • Adding a lot of new features • Requires a lot of labeled examples • What if we only have a few labeled examples? • Other options? • Constrain the output to make sense • Push the (simple) model in a direction that makes sense
Examples of Constraints Easy to express pieces of “knowledge” Non Propositional; May use Quantifiers Each field must be aconsecutive list of words and can appear at mostoncein a citation. State transitions must occur onpunctuation marks. The citation can only start withAUTHORorEDITOR. The wordspp., pagescorrespond toPAGE. Four digits starting with20xx and 19xx areDATE. Quotationscan appear only inTITLE …….
Information Extraction with Constraints Constrained Conditional Models Allow: • Learning a simple model • Make decisions with a more complex model • Accomplished by directly incorporating constraints to bias/re-rank decisions made by the simpler model • Adding constraints, we getcorrectresults! • Without changing the model • [AUTHOR]Lars Ole Andersen . [TITLE]Program analysis andspecialization for the C Programming language . [TECH-REPORT] PhD thesis . [INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .
Guiding (Semi-Supervised) Learning with Constraints • In traditional Semi-Supervised learning the model can drift away from the correct one. • Constraints can be used to generate better training data • At training to improve labeling of un-labeled data (and thus improve the model) • At decision time, to bias the objective function towards favoring constraint satisfaction. Constraints Model Un-labeled Data Decision Time Constraints
Constraints Driven Learning (CoDL) [Chang, Ratinov, Roth, ACL’07;ICML’08,ML, to appear] Related to Ganchev et. al [PR work 09,10] Several Training Paradigms (w0,½0)=learn(L) For N iterations do T= For each x in unlabeled dataset h à argmaxy wTÁ(x,y) - ½k dC(x,y) T=T {(x, h)} (w,½) = (w0,½0) + (1- ) learn(T) Supervised learning algorithm parameterized by (w,½). Learning can be justified as an optimization procedure for an objective function Inference with constraints: augment the training set Learn from new training data Weigh supervised & unsupervised models. Excellent Experimental Results showing the advantages of using constraints, especially with small amounts on labeled data [Chang et. al, Others] Can be viewed as Constrained Expectation Maximization (EM) algorithm. This is the hard EM version; can be generalized in several directions Page 40
Constraints Driven Learning (CODL) [Chang, Ratinov, Roth, ACL’07;ICML’08,ML, to appear] Related to Ganchev et. al [PR work 09,10] • Semi-Supervised Learning Paradigm that makes use of constraints to bootstrap from a small number of examples Objective function: Learning w 10 Constraints Poor model + constraints Constraints are used to: • Bootstrap a semi-supervised learner • Correct weak models predictions on unlabeled data, which in turn are used to keep training the model. Learning w/o Constraints: 300 examples. # of available labeled examples
Constrained Conditional Models [AAAI’08, MLJ’12] • Constrained Conditional Models – ILP formulations – have been shown useful in the context of many NLP problems, [Roth&Yih, 04,07; Chang et. al. 07,08,…] • SRL, Summarization; Co-reference; Information & Relation Extraction; Event Identifications; Transliteration; Textual Entailment; Knowledge Acquisition • Some theoretical work on training paradigms [Punyakanok et. al., 05 more] • See a NAACL’10 tutorial on my web page & an NAACL’09 ILPNLP workshop • Summary of work & a bibliography: http://L2R.cs.uiuc.edu/tutorials.html
Outline • Background: • Natural Language Processing: problems and difficulties • Global Inference with expressive structural constraints in NLP • Constrained Conditional Models • Some Learning Issues in the presence of minimal supervision • Constraints Driven Learning • Learning with Indirect Supervision • Response based Learning • More Examples Learning structured models requires annotating structures Interdependencies among decision variables should be exploited in Inference & Learning. Goal: learn from minimal, indirect supervision. Amplify it using variables’ interdependencies
Connecting Language to the World [CoNLL’10,ACL’11,IJCAI’11] Can I get a coffee with no sugar and just a bit of milk Great! Arggg Semantic Parser MAKE(COFFEE,SUGAR=NO,MILK=LITTLE) Can we rely on this interaction to provide supervision? This requires that we use the minimal binary supervision (“good structure” /”bad structure”) as a way to learn how to generate good structures. SKIP
Key Ideas in Learning Structures • Idea1: Simple, easy to supervise, binary decisions often depend on the structure you care about. Learning to do well on the binary task can drive the structure learning. • Idea2: Global Inference can be used to amplify the minimal supervision. • Idea 2 ½:There are several settings where a binary label can be used to replace a structured label. Perhaps the most intriguing is where you use the world response to the model’s actions.
I. Paraphrase Identification Given an input x 2 X Learn a model f : X ! {-1, 1} • Consider the following sentences: • S1: Druce will face murder charges, Conte said. • S2: Conte said Druce will be charged with murder . • Are S1 and S2 a paraphrase of each other? • There is a need for an intermediate representation to justify this decision H X Y • We need latent variables that explain • why this is a positive example. Given an input x 2 X Learn a model f : X ! H ! {-1, 1}
Algorithms: Two Conceptual Approaches • Two stage approach (a pipeline; typically used forTE, paraphrase id, others) • Learn hidden variables; fix it • Need supervision for the hidden layer (or heuristics) • For each example, extract features over x and (the fixed) h. • Learn a binary classier for the target task • Proposed Approach: Joint Learning • Drive the learning of h from the binary labels • Find the best h(x) • An intermediate structure representation is good to the extent is supports better final prediction. • Algorithm? How to drive learning a good H? H X Y
Algorithmic Intuition • If x is positive • There must exist a good explanation (intermediate representation) • 9 h, wTÁ(x,h) ¸ 0 • or, maxh wTÁ(x,h) ¸ 0 • If x is negative • No explanation is good enough to support the answer • 8 h, wTÁ(x,h) · 0 • or, maxh wTÁ(x,h) · 0 • Altogether, this can be combined into an objective function: Minw¸/2 ||w||2 + Ci L(1-zimaxh 2 C wT{s} hsÁs (xi)) • Why does inference help? • Constrains intermediate representations supporting good predictions New feature vector for the final decision. Chosen hselects a representation. Inference: best h subject to constraints C
Optimization • Non Convex, due to the maximization term inside the global minimization problem • In each iteration: • Find the best feature representation h* for all positive examples (off-the shelf ILP solver) • Having fixed the representation for the positive examples, update w solving the convex optimization problem: • Not the standard SVM/LR: need inference • Asymmetry: Only positive examples require a good intermediate representation that justifies the positive label. • Consequently, the objective function decreases monotonically
Iterative Objective Function Learning Inference best h subj. to C Prediction with inferred h Initial Objective Function Training w/r to binary decision label ILP inference discussed earlier; restrict possible hidden structures considered. Generate features • Formalized as Structured SVM + Constrained Hidden Structure • LCRL: Learning Constrained Latent Representation Update weight vector Feedback relative to binary problem