580 likes | 705 Views
Two Aspects of the Problem of Natural Language Inference. Bill MacCartney NLP Group Stanford University 8 October 2008. Two talks for the price of one!. Both concern the problem of natural language inference (NLI) Modeling semantic containment and exclusion in NLI
E N D
Two Aspects of the Problemof Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008
Two talks for the price of one! Both concern the problem of natural language inference (NLI) • Modeling semantic containment and exclusion in NLI • Presented at Coling-08, won best paper award • A computational model of natural logic for NLI • Doesn’t solve all NLI problems, but handles an interesting subset • Depends on alignments from other sources • A phrase-based model of alignment for NLI • To be presented at EMNLP-08 • Addresses the problem of alignment for NLI & relates it MT • Made possible by annotated data produced here at MSR
Modeling Semantic Containment and Exclusion in Natural Language Inference Bill MacCartney and Christopher D. Manning NLP Group Stanford University 8 October 2008
Some Some no Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Natural language inference (NLI) • Aka recognizing textual entailment (RTE) • Does premise P justify an inference to hypothesis H? • An informal, intuitive notion of inference: not strict logic • Emphasis on variability of linguistic expression P Every firm polled saw costs grow more than expected,even after adjusting for inflation. H Every big company in the poll reported cost increases. yes • Necessary to goal of natural language understanding (NLU) • Can also enable semantic search, question answering, …
robust,but shallow deep,but brittle lexical/semanticoverlap Jijkoun & de Rijke 2005 FOL &theoremproving Bos & Markert 2006 patternedrelationextraction Romano et al. 2006 semantic graph matching Hickl et al. 2006 MacCartney et al. 2006 Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion naturallogic (this work) NLI: a spectrum of approaches Solution? Problem:hard to translate NL to FOL idioms, anaphora, ellipsis, intensionality, tense, aspect, vagueness, modals, indexicals, reciprocals, propositional attitudes, scope ambiguities, anaphoric adjectives, non-intersective adjectives, temporal & causal relations, unselective quantifiers, adverbs of quantification, donkey sentences, generic determiners, comparatives, phrasal verbs, … Problem:imprecise easily confounded by negation, quantifiers, conditionals, factive & implicative verbs, etc.
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Outline • Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion What is natural logic? ( natural deduction) • Characterizes valid patterns of inference via surface forms • precise, yet sidesteps difficulties of translating to FOL • A long history • traditional logic: Aristotle’s syllogisms, scholastics, Leibniz, … • modern natural logic begins with Lakoff (1970) • van Benthem & Sánchez Valencia (1986-91): monotonicity calculus • Nairn et al. (2006): an account of implicatives & factives • We introduce a new theory of natural logic • extends monotonicity calculus to account for negation & exclusion • incorporates elements of Nairn et al.’s model of implicatives
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion 7 basic entailment relations Relations are defined for all semantic types: tiny⊏small, hover⊏fly, kick⊏strike,this morning⊏today, in Beijing⊏in China, everyone⊏someone, all⊏most⊏some
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Entailment & semantic composition • Ordinarily, semantic composition preserves entailment relations: eat pork⊏eat meat, big bird | big fish • But many semantic functions behave differently:tango⊏dance refuse to tango⊐refuse to danceFrench | German not French _ not German • We categorize functions by how they project entailment • a generalization of monotonicity classes, implication signatures • e.g., not has projectivity {=:=, ⊏:⊐, ⊐:⊏, ^:^, |:_, _:|, #:#} • e.g., refuse has projectivity {=:=, ⊏:⊐, ⊐:⊏, ^:|, |:#, _:#, #:#}
@ @ ⊐ ⊐ ⊐ @ @ ⊏ ⊏ @ @ Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion @ @ nobody nobody can can without without a shirt clothes enter enter Projecting entailment relations upward • If two compound expressions differ by a single atom, their entailment relation can be determined compositionally • Assume idealized semantic composition trees • Propagate entailment relation between atoms upward, according to projectivity class of each node on path to root
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion A (weak) inference procedure • Find sequence of edits connecting P and H • Insertions, deletions, substitutions, … • Determine lexical entailment relation for each edit • Substitutions: depends on meaning of substituends: cat | dog • Deletions: ⊏ by default: red socks⊏socks • But some deletions are special: not ill ^ ill, refuse to go | go • Insertions are symmetric to deletions: ⊐ by default • Project up to find entailment relation across each edit • Join entailment relations across sequence of edits • à la Tarski’s relation algebra
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion The NatLog system NLI problem linguistic analysis 1 alignment 2 lexical entailment classification 3 entailment projection 4 entailment joining 5 prediction
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Running example P Jimmy Dean refused to move without blue jeans. H James Dean didn’t dance without pantsyes OK, the example is contrived, but it compactly exhibits containment, exclusion, and implicativity
refuse without JimmyDean move blue jeans Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion + + + – – – + + Step 1: Linguistic analysis • Tokenize & parse input sentences (future: & NER & coref & …) • Identify items w/ special projectivity & determine scope • Problem: PTB-style parse tree semantic structure! S category: –/o implicatives examples: refuse, forbid, prohibit, … scope: S complement pattern: __ > (/VB.*/ > VP $. S=arg) projectivity: {=:=, ⊏:⊐, ⊐:⊏, ^:|, |:#, _:#, #:#} VP S VP VP PP NP NP NNP NNP VBD TO VB IN JJ NNS Jimmy Dean refused to move without blue jeans • Solution: specify scope in PTB trees using Tregex [Levy & Andrew 06]
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 2: Alignment • Alignment as sequence of atomic phrase edits • Ordering of edits defines path through intermediate forms • Need not correspond to sentence order • Decomposes problem into atomic inference problems • We haven’t (yet) invested much effort here • Experimental results use alignments from other sources
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 3: Lexical entailment classification • Goal: predict entailment relation for each edit, based solely on lexical features, independent of context • Approach: use lexical resources & machine learning • Feature representation: • WordNet features: synonymy (=), hyponymy (⊏/⊐), antonymy (|) • Other relatedness features: Jiang-Conrath (WN-based), NomBank • Fallback: string similarity (based on Levenshtein edit distance) • Also lexical category, quantifier category, implication signature • Decision tree classifier • Trained on 2,449 hand-annotated lexical entailment problems • E.g., SUB(gun, weapon): ⊏, SUB(big, small): |, DEL(often): ⊏
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 3: Lexical entailment classification
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion inversion Step 4: Entailment projection
For example: human ^ nonhuman fish | human fish⊏nonhuman Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion final answer Step 5: Entailment joining
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion The FraCaS test suite • FraCaS: a project in computational semantics [Cooper et al. 96] • 346 “textbook” examples of NLI problems • 3 possible answers: yes, no, unknown (not balanced!) • 55% single-premise, 45% multi-premise (excluded)
27% error reduction Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Results on FraCaS
27% error reduction in largest category, all but one correct high accuracy in sections most amenable to natural logic Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion high precision even outsideareas of expertise Results on FraCaS
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion The RTE3 test suite • Somewhat more “natural”, but not ideal for NatLog • Many kinds of inference not addressed by NatLog:paraphrase, temporal reasoning, relation extraction, … • Big edit distance propagation of errors from atomic model
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Results on RTE3: NatLog (each data set contains 800 problems) • Accuracy is unimpressive, but precision is relatively high • Strategy: hybridize with Stanford RTE system • As in Bos & Markert 2006 • But NatLog makes positive prediction far more often (~25% vs. 4%)
4% gain (significant,p < 0.05) Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Results on RTE3: hybrid system (each data set contains 800 problems)
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Conclusion: what natural logic can’t do • Not a universal solution for NLI • Many types of inference not amenable to natural logic • Paraphrase: Eve was let go = Eve lost her job • Verb/frame alternation: he drained the oil⊏the oil drained • Relation extraction: Aho, a trader at UBS…⊏Aho works for UBS • Common-sense reasoning: the sink overflowed⊏the floor got wet • etc. • Also, has a weaker proof theory than FOL • Can’t explain, e.g., de Morgan’s laws for quantifiers: Not all birds fly= Some birds don’t fly
Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Conclusion: what natural logic can do Natural logic enables precise reasoning about containment, exclusion, and implicativity, while sidestepping the difficulties of translating to FOL. The NatLog system successfully handles a broad range of such inferences, as demonstrated on the FraCaS test suite. Ultimately, open-domain NLI is likely to require combining disparate reasoners, and a facility for natural logic is a good candidate to be a component of such a system.
A Phrase-Based Model of Alignmentfor Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 8 October 2008
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Natural language inference (NLI) (aka RTE) • Does premise P justify an inference to hypothesis H? • An informal notion of inference; variability of linguistic expression P In 1963, JFK was assassinated during a visit to Dallas. H Kennedy was killed in 1963. yes • Like MT, NLI depends on a facility for alignment • I.e., linking corresponding words/phrases in two related sentences • Alignment addressed variously by current NLI systems • Implicit alignment: NLI via lexical overlap [Glickman et al. 05, Jijkoun & de Rijke 05] • Implicit alignment: NLI as proof search [Tatu & Moldovan 07, Bar-Haim et al. 07] • Explicit alignment entailment classification [Marsi & Kramer 05, MacCartney et al. 06]
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Contributions of this paper In this paper, we: • Undertake the first systematic study of alignment for NLI • Existing NLI aligners use idiosyncratic methods, are poorly documented, use proprietary data • Propose a new model of alignment for NLI: MANLI • Uses a phrase-based alignment representation • Exploits external lexical resources • Capitalizes on new supervised training data • Examine the relation between alignment in NLI and MT • Can existing MT aligners be applied in the NLI setting?
Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion NLI alignment vs. MT alignment • Alignment is familiar in MT, with extensive literature • Can tools & techniques of MT alignment transfer to NLI? • Doubtful — NLI alignment differs in several respects: • Monolingual: can exploit resources like WordNet • Asymmetric: P often longer & has content unrelated to H • Cannot assume semantic equivalence • NLI aligner must accommodate frequent unaligned content • Little training data available • MT aligners use unsupervised training on massive amounts of bitext • NLI aligners must rely on supervised training & much less data
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MSR RTE2 alignment data • Previously, little supervised data • Now, MSR gold alignments for RTE2 • [Brockett 2007] • dev & test sets, 800 problems each • Token-based, but many-to-many • allows implicit alignment of phrases • 3 independent annotators • 3 of 3 agreed on 70% of proposed links • 2 of 3 agreed on 99.7% of proposed links • merged using majority rule
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MANLI aligner A new model of alignment for natural language inference Phrase-based representation Feature-based scoring function Decoding using simulated annealing Perceptron learning
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Phrase-based alignment representation Represent alignments by sequence of phrase edits: EQ, SUB, DEL, INS DEL(In1) DEL(most2) DEL(Pacific3) DEL(countries4) DEL(there5) EQ(are6, are2) SUB(very7few8, poorly3represented4) EQ(women9, Women1) EQ(in10, in5) EQ(parliament11, parliament6) EQ(.12, .7) • One-to-one at phrase level (but many-to-many at token level) • Avoids arbitrary alignment choices; can use phrase-based resources • For training (only!), converted MSR data to this form
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion A feature-based scoring function • Score edits as linear combination of features, then sum: • Edit type features: EQ, SUB, DEL, INS • Phrase features: phrase sizes, non-constituents • Lexical similarity feature: max over similarity scores • WordNet: synonymy, hyponymy, antonymy, Jiang-Conrath • Distributional similarity à la Dekang Lin • Various measures of string/lemma similarity • Contextual features: distortion, matching neighbors
Start … Generate successors Score Smooth/sharpen P(A) = P(A)1/T Sample Lower temp Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion T = 0.9 T Repeat Decoding using simulated annealing … 100 times
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Perceptron learning of feature weights We use a variant of averaged perceptron [Collins 2002] Initialize weight vector w = 0, learning rate R0 = 1 For training epoch i = 1 to 50: For each problem Pj, Hj with gold alignment Ej: Set Êj = ALIGN(Pj, Hj, w) Set w = w + Ri ((Ej) – (Êj)) Set w = w / ‖w‖2 (L2 normalization) Set w[i] = w (store weight vector for this epoch) Set Ri = 0.8 Ri–1 (reduce learning rate) Throw away weight vectors from first 20% of epochs Return average weight vector Training runs require about 20 hours (for 800 problems)
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Evaluation on MSR data • We evaluate several systems on MSR data • Baseline, GIZA++ & Cross-EM, Stanford RTE, MANLI • How well do they recover gold-standard alignments? • We report per-link precision, recall, and F1 • Note that AER = 1 – F1 • For MANLI, two tokens are considered to be aligned iff they are contained within phrases which are aligned • We also report exact match rate • What proportion of guessed alignments match gold exactly?
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Baseline: bag-of-words aligner • Surprisingly good recall, despite extreme simplicity • But very mediocre precision, F1, & exact match rate • Main problem: aligns every token in H Match each H token to most similar P token: [cf. Glickman et al. 2005]
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion MT aligners: GIZA++ & Cross-EM • Why not use off-the-shelf MT aligner for NLI? • Run GIZA++ via Moses, with default parameters • Asymmetric alignments in both directions • Then symmetrize using INTERSECTION heuristic • Initial results are very poor: 56% F1 • Doesn’t even align equal words • Remedy: add lexicon of equal words as extra training data • Do similar experiments with Berkeley Cross-EM aligner
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: MT aligners Similar F1, but GIZA++ wins on precision, Cross-EM on recall • Both do best with lexicon & INTERSECTION heuristic • Also tried UNION, GROW, GROW-DIAG, GROW-DIAG-FINAL,GROW-DIAG-FINAL-AND, and asymmetric alignments • All achieve better recall, but much worse precision & F1 • Problem: too little data for unsupervised learning • Need to compensate by exploiting external lexical resources
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The Stanford RTE aligner • Token-based alignments: map from H tokens to P tokens • Phrase alignments not directly representable • But named entities, collocations collapsed in pre-processing • Exploits external lexical resources • WordNet, LSA, distributional similarity, string sim, … • Syntax-based features to promote aligning corresponding predicate-argument structures • Decoding & learning similar to MANLI
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: Stanford RTE aligner • Better F1 than MT aligners — but recall lags precision • Stanford does poor job aligning function words • 13% of links in gold are prepositions & articles • Stanford misses 67% of these (MANLI only 10%) • Also, Stanford fails to align multi-word phrases peace activists ~ protestors, hackers ~ non-authorized personnel * * * * * includes (generous) correction for missed punctuation
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: MANLI aligner • MANLI outperforms all others on every measure • F1: 10.5% higher than GIZA++, 6.2% higher than Stanford • Good balance of precision & recall • Matched >20% exactly
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion MANLI results: discussion • Three factors contribute to success: • Lexical resources: jail ~ prison, prevent ~ stop , injured ~ wounded • Contextual features enable matching function words • Phrases: death penalty ~ capital punishment, abdicate ~ give up • But phrases help less than expected! • If we set max phrase size = 1, we lose just 0.2% in F1 • Recall errors: room to improve • 40%: need better lexical resources: conservation ~ protecting, organization ~ agencies, bone fragility ~ osteoporosis • Precision errors harder to reduce • function words (49%), be (21%), punct (7%), equal lemmas (18%)
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Can aligners predict RTE answers? • We’ve been evaluating against gold-standard alignments • But alignment is just one component of an NLI system • Does a good alignment indicate a valid inference? • Not necessarily: negations, modals, non-factives & implicatives, … • But alignment score can be strongly predictive • And many NLI systems rely solely on alignment • Using alignment score to predict RTE answers: • Predict YES if score > threshold • Tune threshold on development data • Evaluate on test data
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: predicting RTE answers • No NLI aligner rivals top LCC system • But, Stanford & MANLI beat average entry for RTE2 • Many NLI systems could benefit from better alignments!
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Related work • Lots of past work on phrase-based MT • But most systems extract phrases from word-aligned data • Despite assumption that many translations are non-compositional • Recent work jointly aligns & weights phrases[Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] • However, this is of limited applicability to the NLI task • MANLI uses phrases only when words aren’t appropriate • MT uses longer phrases to realize more dependencies(e.g. word order, agreement, subcategorization) • MT systems don’t model word insertions & deletions
Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion :-) Thanks! Questions? Conclusion • MT aligners not directly applicable to NLI • They rely on unsupervised learning from massive amount of bitext • They assume semantic equivalence of P & H • MANLI succeeds by: • Exploiting (manually & automatically constructed) lexical resources • Accommodating frequent unaligned phrases • Phrase-based representation shows potential • But not yet proven: need better phrase-based lexical resources