350 likes | 369 Views
This talk explores global inference over local models/classifiers with expressive constraints, including training paradigms, semi-supervised learning, and examples in semantic parsing and information extraction.
E N D
Learning and Global Inferencefor Information Access and Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign May 2007 DIMACS ONR Workshop
Learning and Inference • Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. • (Learned) models/classifiers for different sub-problems • Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local models as well as domain & context specific constraints. • Global inference for the best assignment to all variables of interest.
A process that maintains and updates a collection of propositions about the state of affairs. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.
What we Know: Stand Alone Ambiguity Resolution Illinois’ bored of education board ...Nissan Car and truck plantis … …divide life into plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. Hewas taken to a veterinarian a hospital Learn a function f: X Ythat maps observations in a domain to one of several categories or < Broad Coverage
A process that maintains and updates a collection of propositions about the state of affairs. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. This is an Inference Problem
This Talk • Global Inference over Local Models/Classifiers + Expressive Constraints • Model • Generality of the framework • Training Paradigms • Global vs. Local training • Semi-Supervised Learning • Examples • Semantic Parsing • Information Extraction • Pipeline processes
C(y2,y3,y6,y7,y8) C(y1,y4) y1 y2 y3 y4 y5 y6 y7 y8 (+ WC) Problem Setting • Random Variables Y: • Conditional DistributionsP (learned by models/classifiers) • Constraints C– any Boolean function defined on partial assignments (possibly: + weights W ) • Goal: Find the “best” assignment • The assignment that achieves the highest global accuracy. • This is an Integer Programming Problem Y*=argmaxYPY subject to constraints C Other, more general ways to incorporate soft constraints here [ACL’07]
Penalty for violating the constraint. Weight Vector for “local” models How far away is y from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination Formal Model Subject to constraints (Soft) constraints component How to solve? This is an Integer Linear Program In many of our applications, large scale problems were solved efficiently using a commercial ILP package to yield exact solution. Search techniques are also possible
A General Inference Setting • Essentially all complex models studied today can be viewed as optimizing a linear objective function: HMMs/CRFs[Roth’99; Collins’02;Laffarty et. al 02] • Linear objective functions can be derived from probabilistic perspective: • Markov Random Field [standard; Kleinberg&Tardos] Optimization Problem (Metric Labeling) [Chekuri et. al’01] Linear Programming Problems Inference as Constrained Optimization[Yih&Roth CoNLL’04] [Punyakanok et. al COLING’04]… • The probabilistic perspective supports finding the most likely assignment • Not necessarily what we want • Our Integer linear programming (ILP) formulation • Allows the incorporation of more general cost functions • General (non-sequential) constraint structure • Better exploitation (computationally) of hard constraints • Can find the optimal solution if desired
Example 1: Semantic Role Labeling Who did what to whom, when, where, why,… I left my pearls to my daughter in my will . [I]A0left[my pearls]A1[to my daughter]A2[in my will]AM-LOC . • A0 Leaver • A1 Things left • A2 Benefactor • AM-LOC Location I left my pearls to my daughter in my will . • Special Case (structure output problem): here, all the data is available at one time; in general, classifiers might be learned from different sources, at different times, at different contexts. • Implications on training paradigms Overlapping arguments If A2 is present, A1 must also be present.
Semantic Role Labeling (2/2) • PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. • It adds a layer of generic semantic labels to Penn Tree Bank II. • (Almost) all the labels are on the constituents of the parse trees. • Core arguments: A0-A5 and AA • different semantics for each verb • specified in the PropBank Frame files • 13 types of adjuncts labeled as AM-arg • where arg specifies the adjunct type
I left my nice pearls to her I left my nice pearls to her I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ [ [ [ [ [ ] ] ] ] ] ] ] ] ] ] Identify Vocabulary Algorithmic Approach candidate arguments • Identify argument candidates • Pruning [Xue&Palmer, EMNLP’04] • Argument Identifier • Binary classification (SNoW) • Classify argument candidates • Argument Classifier • Multi-class classification (SNoW) • Inference • Use the estimated probability distribution given by the argument classifier • Use structural and linguistic constraints • Infer the optimal global output Inference over (old and new) Vocabulary Ileftmy nice pearlsto her
I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] Argument Identification & Classification • Both argument identifier and argument classifier are trained phrase-based classifiers. • Features (some examples) • voice, phrase type, head word, path, chunk, chunk pattern, etc. [some make use of a full syntactic parse] • Learning Algorithm – SNoW • Sparse network of linear functions • weights learned by regularized Winnow multiplicative update rule • Probability conversion is done via softmax pi = exp{acti}/j exp{actj}
I left my nice pearls to her Inference • The output of the argument classifier often violates some constraints, especially when the sentence is long. • Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming. [Punyakanok et. al 04, Roth & Yih 04] • Input: • The probability estimation (by the argument classifier) • Structural and linguistic constraints • Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).
Integer Linear Programming Inference • For each argument ai • Set up a Boolean variable: ai,tindicating whether ai is classified as t • Goal is to maximize • i score(ai = t ) ai,t • Subject to the (linear) constraints • Any Boolean constraints can be encoded as linear constraint(s). • If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints.
Constraints Any Boolean rule can be encoded as a linear constraint. • No duplicate argument classes aPOTARG x{a = A0} 1 • R-ARG a2POTARG , aPOTARG x{a = A0}x{a2 = R-A0} • C-ARG • a2POTARG , (aPOTARG) (a is before a2 )x{a = A0}x{a2 = C-A0} • Many other possible constraints: • Unique labels • No overlapping or embedding • Relations between number of arguments • If verb is of type A, no argument of type B If there is an R-ARG phrase, there is an ARG Phrase If there is an C-ARG phrase, there is an ARG before it Universally quantified rules Joint inference can be used also to combine different SRL Systems.
Semantic Parsing: Summary I • This approach produces a very good semantic parser. F1~<90% • Easy and fast: ~7 Sentences/Second (using Xpress-MP) • A lot of room for improvement (additional constraints) • Demo available http://L2R.cs.uiuc.edu/~cogcomp Top ranked system in CoNLL’05 shared task Key difference is the Inference
Extracting Relations via Semantic Analysis Screen shot from a CCG demo http://L2R.cs.uiuc.edu/~cogcomp • Semantic parsing reveals several relations in the sentence along with their arguments.
Semantic Parsing: Summary I • This approach produces a very good semantic parser. F1~<90% • Easy and fast: ~7 Sentences/Second (using Xpress-MP) • A lot of room for improvement (additional constraints) • Demo available http://L2R.cs.uiuc.edu/~cogcomp • So far, shown the use of only declarative (deterministic) constraints. • In fact, this approach can be used both with statistical and declarative constraints. Top ranked system in CoNLL’05 shared task Key difference is the Inference
y1 y2 y3 y4 y5 y x1 x2 x3 x4 x5 x s t A A A A A B B B B B C C C C C ILP as a Unified Algorithmic Scheme • Consider a common model for sequential inference: HMM/CRF • Inference in this model is done via the Viterbi Algorithm. • Viterbi is a special case of the Linear Programming based Inference. • Viterbi is a shortest path problem, which is a LP, with a canonical matrix that is totally unimodular. Therefore, you can get integrality constraints for free. • One can now incorporate non-sequential/expressive/declarative constraints by modifying this canonical matrix • The extension reduces to a polynomial scheme under some conditions (e.g., when constraints are sequential, when the solution space does not change, etc.) • Not necessarily increases complexity and very efficient in practice [Roth&Yih, ICML’05] Learn a rather simple model; make decisions with a more expressive model
Integer Linear Programming Inference - Summary • An Inference method for the “best explanation”, used here to induce a semantic representation of a sentence. • A general Information Integration framework. • Allows expressive constraints • Any Boolean rule can be represented by a set of linear (in)equalities • Combining acquired (statistical) constraints with declarative constraints • Start with shortest path matrix and constraints • Add new constraints to the basic integer linear program. • Solved using off-the-shelf packages • If the additional constraints don’t change the solution, LP is enough • Otherwise, the computational time depends on sparsity; fast in practice • Demo available http://L2R.cs.uiuc.edu/~cogcomp
Example 2: Pipeline • Vocabulary is generated in phases • Left to Right processing of sentences is also a pipeline process Raw Data • Pipelining is a crude approximation; interactions occur across levels and down stream decisions often interact with previous decisions. • Leads to propagation of errors • Occasionally, later stage problems are easier but upstream mistakes will not be corrected. • There are good reasons for pipelining decisions • Global inference over the outcomes of different levels can be used to break away from this paradigm. [between pipeline & fully global] • Allows a flexible way to incorporate linguistic and structural constraints. POS Tagging Phrases Semantic Entities Relations Parsing WSD Semantic Role Labeling Objective Function can be modified to support pipelines; Deep Pipelines for Dependency Parsing [Chang, Do, Roth, ACL’06]
This Talk • Global Inference over Local Models/Classifiers + Expressive Constraints • Model • Generality of the framework • Training Paradigms • Global vs. Local training • Semi-Supervised Learning • Examples • Semantic Parsing • Information Extraction • Pipeline processes
Training Paradigms that Support Global Inference • Incorporating general constraints (Algorithmic Approach) • Allow both statistical and expressive declarative constraints • Allow non-sequential constraints (generally difficult) • Coupling vs. Decoupling Training and Inference. • Incorporating global constraints is important but • Should it be done only at evaluation time or also at training time? • Issues related to: • modularity, efficiency and performance, availability of training data May not be relevant in some problems.
Training in the presence of Constraints • General Training Paradigm: • First Term: Learning from data • Second Term: Lead the model by constraints • Can choose if constraints are included in training or only in evaluation
Learning the components together! y2 y3 y1 y4 y5 x3 x4 x1 x5 x2 x7 x6 Cartoon: each model can be more complex and may have a view on a set of output variables. L+I: Learning plus Inference Training w/o ConstraintsTesting: Inference with Constraints f1(x) IBT: Inference-based Training f2(x) f3(x) Y f4(x) f5(x) Which one is better? When and Why? X
True Global Labeling Y -1 1 -1 -1 1 Local Predictions Apply Constraints: Y’ -1 1 -1 1 1 x3 x4 x1 x5 x2 x7 x6 Y’ -1 1 1 1 1 Perceptron-based Global Learning f1(x) X f2(x) f3(x) Y f4(x) f5(x)
Claims • When the local classification problems are “easy”, L+I outperforms IBT. • In many applications, the components are identifiable and easy to learn (e.g., argument, open-close, PER). • Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a larger number of training examples. • When data is scarce, problems are not easy and constraints can be used, along with a “weak” model, to label unlabeled data and improve mode. • Will show experimental results and theoretical intuition to support our claims. L+I: cheaper computationally; modular IBT is better in the limit, and other extreme cases. Combinations: L+I, and then IBT are possible
Bounds Simulated Data opt=0.1 opt=0 opt=0.2 L+I vs. IBT: the more identifiable individual problems are the better overall performance is with L+I Bound Prediction • Local ≤ opt + ( ( d log m + log 1/ ) / m)1/2 • Global ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m)1/2 Indication for hardness of problem
L+I is better. When the problem is artificially made harder, the tradeoff is clearer. Relative Merits: SRL Difficulty of the learning problem(# features) hard easy
Semi-Supervised Learning with Constraints Experiment is done in the context of Information extraction Objective function: Learning w/o Constraints: 300 examples. Constraints are used to Bootstrap a semi-supervised learner Poor model + constraints are used to annotate unlabeled data, which in turn is used to keep training the model. Learning w Constraints # of available labeled examples
By “semantically entailed” we mean: most people would agree that one sentence implies the other. Example 3: Textual Entailment Phrasal verb paraphrasing [Connor&Roth’06] • Given: Q: Who acquired Overture? • Determine: A: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year. Entity matching [Li et. al, AAAI’04, NAACL’04] Semantic Role Labeling Entails Subsumed by Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year Yahoo acquired Overture Overture is a search company Google is a search company Google owns Overture ……….
Conclusions • Discussed a general paradigm for learning and inference in the context of natural language understanding tasks • A general Constraint Optimization approach for integration of learned models with additional (declarative or statistical) expressivity • How to train? • Learn Locally and make use globally (via global inference) • [Punyakanok et. al IJCAI’05] • Ability to make use of domain & constraints to drive supervision • [Klementiev & Roth, ACL’06; Chang, Do, Roth, ACL’07] • How to drive component identification from the global decisoins? LBJ (Learning Based Java): A modeling language that supports programming along with building learned models and allows the incorporation of and inference with constraints