Two Aspects of the Problem of Natural Language Inference

Two Aspects of the Problemof Natural Language Inference Bill MacCartney NLP Group Stanford University 8 October 2008

Two talks for the price of one! Both concern the problem of natural language inference (NLI) • Modeling semantic containment and exclusion in NLI • Presented at Coling-08, won best paper award • A computational model of natural logic for NLI • Doesn’t solve all NLI problems, but handles an interesting subset • Depends on alignments from other sources • A phrase-based model of alignment for NLI • To be presented at EMNLP-08 • Addresses the problem of alignment for NLI & relates it MT • Made possible by annotated data produced here at MSR

Modeling Semantic Containment and Exclusion in Natural Language Inference Bill MacCartney and Christopher D. Manning NLP Group Stanford University 8 October 2008

Some Some no Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Natural language inference (NLI) • Aka recognizing textual entailment (RTE) • Does premise P justify an inference to hypothesis H? • An informal, intuitive notion of inference: not strict logic • Emphasis on variability of linguistic expression P Every firm polled saw costs grow more than expected,even after adjusting for inflation. H Every big company in the poll reported cost increases. yes • Necessary to goal of natural language understanding (NLU) • Can also enable semantic search, question answering, …

robust,but shallow deep,but brittle lexical/semanticoverlap Jijkoun & de Rijke 2005 FOL &theoremproving Bos & Markert 2006 patternedrelationextraction Romano et al. 2006 semantic graph matching Hickl et al. 2006 MacCartney et al. 2006 Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion naturallogic (this work) NLI: a spectrum of approaches Solution? Problem:hard to translate NL to FOL idioms, anaphora, ellipsis, intensionality, tense, aspect, vagueness, modals, indexicals, reciprocals, propositional attitudes, scope ambiguities, anaphoric adjectives, non-intersective adjectives, temporal & causal relations, unselective quantifiers, adverbs of quantification, donkey sentences, generic determiners, comparatives, phrasal verbs, … Problem:imprecise  easily confounded by negation, quantifiers, conditionals, factive & implicative verbs, etc.

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Outline • Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion What is natural logic? ( natural deduction) • Characterizes valid patterns of inference via surface forms • precise, yet sidesteps difficulties of translating to FOL • A long history • traditional logic: Aristotle’s syllogisms, scholastics, Leibniz, … • modern natural logic begins with Lakoff (1970) • van Benthem & Sánchez Valencia (1986-91): monotonicity calculus • Nairn et al. (2006): an account of implicatives & factives • We introduce a new theory of natural logic • extends monotonicity calculus to account for negation & exclusion • incorporates elements of Nairn et al.’s model of implicatives

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion 7 basic entailment relations Relations are defined for all semantic types: tiny⊏small, hover⊏fly, kick⊏strike,this morning⊏today, in Beijing⊏in China, everyone⊏someone, all⊏most⊏some

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Entailment & semantic composition • Ordinarily, semantic composition preserves entailment relations: eat pork⊏eat meat, big bird | big fish • But many semantic functions behave differently:tango⊏dance  refuse to tango⊐refuse to danceFrench | German  not French _ not German • We categorize functions by how they project entailment • a generalization of monotonicity classes, implication signatures • e.g., not has projectivity {=:=, ⊏:⊐, ⊐:⊏, ^:^, |:_, _:|, #:#} • e.g., refuse has projectivity {=:=, ⊏:⊐, ⊐:⊏, ^:|, |:#, _:#, #:#}

@ @ ⊐ ⊐ ⊐ @ @ ⊏ ⊏ @ @ Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion @ @ nobody nobody can can without without a shirt clothes enter enter Projecting entailment relations upward • If two compound expressions differ by a single atom, their entailment relation can be determined compositionally • Assume idealized semantic composition trees • Propagate entailment relation between atoms upward, according to projectivity class of each node on path to root

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion A (weak) inference procedure • Find sequence of edits connecting P and H • Insertions, deletions, substitutions, … • Determine lexical entailment relation for each edit • Substitutions: depends on meaning of substituends: cat | dog • Deletions: ⊏ by default: red socks⊏socks • But some deletions are special: not ill ^ ill, refuse to go | go • Insertions are symmetric to deletions: ⊐ by default • Project up to find entailment relation across each edit • Join entailment relations across sequence of edits • à la Tarski’s relation algebra

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion The NatLog system NLI problem linguistic analysis 1 alignment 2 lexical entailment classification 3 entailment projection 4 entailment joining 5 prediction

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Running example P Jimmy Dean refused to move without blue jeans. H James Dean didn’t dance without pantsyes OK, the example is contrived, but it compactly exhibits containment, exclusion, and implicativity

refuse without JimmyDean move blue jeans Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion + + + – – – + + Step 1: Linguistic analysis • Tokenize & parse input sentences (future: & NER & coref & …) • Identify items w/ special projectivity & determine scope • Problem: PTB-style parse tree  semantic structure! S category: –/o implicatives examples: refuse, forbid, prohibit, … scope: S complement pattern: __ > (/VB.*/ > VP $. S=arg) projectivity: {=:=, ⊏:⊐, ⊐:⊏, ^:|, |:#, _:#, #:#} VP S VP VP PP NP NP NNP NNP VBD TO VB IN JJ NNS Jimmy Dean refused to move without blue jeans • Solution: specify scope in PTB trees using Tregex [Levy & Andrew 06]

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 2: Alignment • Alignment as sequence of atomic phrase edits • Ordering of edits defines path through intermediate forms • Need not correspond to sentence order • Decomposes problem into atomic inference problems • We haven’t (yet) invested much effort here • Experimental results use alignments from other sources

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 3: Lexical entailment classification • Goal: predict entailment relation for each edit, based solely on lexical features, independent of context • Approach: use lexical resources & machine learning • Feature representation: • WordNet features: synonymy (=), hyponymy (⊏/⊐), antonymy (|) • Other relatedness features: Jiang-Conrath (WN-based), NomBank • Fallback: string similarity (based on Levenshtein edit distance) • Also lexical category, quantifier category, implication signature • Decision tree classifier • Trained on 2,449 hand-annotated lexical entailment problems • E.g., SUB(gun, weapon): ⊏, SUB(big, small): |, DEL(often): ⊏

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Step 3: Lexical entailment classification

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion inversion Step 4: Entailment projection

For example: human ^ nonhuman fish | human fish⊏nonhuman Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion final answer Step 5: Entailment joining

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion The FraCaS test suite • FraCaS: a project in computational semantics [Cooper et al. 96] • 346 “textbook” examples of NLI problems • 3 possible answers: yes, no, unknown (not balanced!) • 55% single-premise, 45% multi-premise (excluded)

27% error reduction Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Results on FraCaS

27% error reduction in largest category, all but one correct high accuracy in sections most amenable to natural logic Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion high precision even outsideareas of expertise Results on FraCaS

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion The RTE3 test suite • Somewhat more “natural”, but not ideal for NatLog • Many kinds of inference not addressed by NatLog:paraphrase, temporal reasoning, relation extraction, … • Big edit distance  propagation of errors from atomic model

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Results on RTE3: NatLog (each data set contains 800 problems) • Accuracy is unimpressive, but precision is relatively high • Strategy: hybridize with Stanford RTE system • As in Bos & Markert 2006 • But NatLog makes positive prediction far more often (~25% vs. 4%)

4% gain (significant,p < 0.05) Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Results on RTE3: hybrid system (each data set contains 800 problems)

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Conclusion: what natural logic can’t do • Not a universal solution for NLI • Many types of inference not amenable to natural logic • Paraphrase: Eve was let go = Eve lost her job • Verb/frame alternation: he drained the oil⊏the oil drained • Relation extraction: Aho, a trader at UBS…⊏Aho works for UBS • Common-sense reasoning: the sink overflowed⊏the floor got wet • etc. • Also, has a weaker proof theory than FOL • Can’t explain, e.g., de Morgan’s laws for quantifiers: Not all birds fly= Some birds don’t fly

Introduction • A Theory of Natural Logic • The NatLog System • Experiments with FraCaS • Experiments with RTE • Conclusion Conclusion: what natural logic can do Natural logic enables precise reasoning about containment, exclusion, and implicativity, while sidestepping the difficulties of translating to FOL. The NatLog system successfully handles a broad range of such inferences, as demonstrated on the FraCaS test suite. Ultimately, open-domain NLI is likely to require combining disparate reasoners, and a facility for natural logic is a good candidate to be a component of such a system.

A Phrase-Based Model of Alignmentfor Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 8 October 2008

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Natural language inference (NLI) (aka RTE) • Does premise P justify an inference to hypothesis H? • An informal notion of inference; variability of linguistic expression P In 1963, JFK was assassinated during a visit to Dallas. H Kennedy was killed in 1963. yes • Like MT, NLI depends on a facility for alignment • I.e., linking corresponding words/phrases in two related sentences • Alignment addressed variously by current NLI systems • Implicit alignment: NLI via lexical overlap [Glickman et al. 05, Jijkoun & de Rijke 05] • Implicit alignment: NLI as proof search [Tatu & Moldovan 07, Bar-Haim et al. 07] • Explicit alignment  entailment classification [Marsi & Kramer 05, MacCartney et al. 06]

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Contributions of this paper In this paper, we: • Undertake the first systematic study of alignment for NLI • Existing NLI aligners use idiosyncratic methods, are poorly documented, use proprietary data • Propose a new model of alignment for NLI: MANLI • Uses a phrase-based alignment representation • Exploits external lexical resources • Capitalizes on new supervised training data • Examine the relation between alignment in NLI and MT • Can existing MT aligners be applied in the NLI setting?

Introduction • NLI vs. MT • The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion NLI alignment vs. MT alignment • Alignment is familiar in MT, with extensive literature • Can tools & techniques of MT alignment transfer to NLI? • Doubtful — NLI alignment differs in several respects: • Monolingual: can exploit resources like WordNet • Asymmetric: P often longer & has content unrelated to H • Cannot assume semantic equivalence • NLI aligner must accommodate frequent unaligned content • Little training data available • MT aligners use unsupervised training on massive amounts of bitext • NLI aligners must rely on supervised training & much less data

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MSR RTE2 alignment data • Previously, little supervised data • Now, MSR gold alignments for RTE2 • [Brockett 2007] • dev & test sets, 800 problems each • Token-based, but many-to-many • allows implicit alignment of phrases • 3 independent annotators • 3 of 3 agreed on 70% of proposed links • 2 of 3 agreed on 99.7% of proposed links • merged using majority rule

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MANLI aligner A new model of alignment for natural language inference Phrase-based representation Feature-based scoring function Decoding using simulated annealing Perceptron learning

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Phrase-based alignment representation Represent alignments by sequence of phrase edits: EQ, SUB, DEL, INS DEL(In1) DEL(most2) DEL(Pacific3) DEL(countries4) DEL(there5) EQ(are6, are2) SUB(very7few8, poorly3represented4) EQ(women9, Women1) EQ(in10, in5) EQ(parliament11, parliament6) EQ(.12, .7) • One-to-one at phrase level (but many-to-many at token level) • Avoids arbitrary alignment choices; can use phrase-based resources • For training (only!), converted MSR data to this form

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion A feature-based scoring function • Score edits as linear combination of features, then sum: • Edit type features: EQ, SUB, DEL, INS • Phrase features: phrase sizes, non-constituents • Lexical similarity feature: max over similarity scores • WordNet: synonymy, hyponymy, antonymy, Jiang-Conrath • Distributional similarity à la Dekang Lin • Various measures of string/lemma similarity • Contextual features: distortion, matching neighbors

Start … Generate successors Score Smooth/sharpen P(A) = P(A)1/T Sample Lower temp Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion T = 0.9 T Repeat Decoding using simulated annealing … 100 times

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Perceptron learning of feature weights We use a variant of averaged perceptron [Collins 2002] Initialize weight vector w = 0, learning rate R0 = 1 For training epoch i = 1 to 50: For each problem Pj, Hj with gold alignment Ej: Set Êj = ALIGN(Pj, Hj, w) Set w = w + Ri ((Ej) – (Êj)) Set w = w / ‖w‖2 (L2 normalization) Set w[i] = w (store weight vector for this epoch) Set Ri = 0.8 Ri–1 (reduce learning rate) Throw away weight vectors from first 20% of epochs Return average weight vector Training runs require about 20 hours (for 800 problems)

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Evaluation on MSR data • We evaluate several systems on MSR data • Baseline, GIZA++ & Cross-EM, Stanford RTE, MANLI • How well do they recover gold-standard alignments? • We report per-link precision, recall, and F1 • Note that AER = 1 – F1 • For MANLI, two tokens are considered to be aligned iff they are contained within phrases which are aligned • We also report exact match rate • What proportion of guessed alignments match gold exactly?

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Baseline: bag-of-words aligner • Surprisingly good recall, despite extreme simplicity • But very mediocre precision, F1, & exact match rate • Main problem: aligns every token in H Match each H token to most similar P token: [cf. Glickman et al. 2005]

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion MT aligners: GIZA++ & Cross-EM • Why not use off-the-shelf MT aligner for NLI? • Run GIZA++ via Moses, with default parameters • Asymmetric alignments in both directions • Then symmetrize using INTERSECTION heuristic • Initial results are very poor: 56% F1 • Doesn’t even align equal words • Remedy: add lexicon of equal words as extra training data • Do similar experiments with Berkeley Cross-EM aligner

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: MT aligners Similar F1, but GIZA++ wins on precision, Cross-EM on recall • Both do best with lexicon & INTERSECTION heuristic • Also tried UNION, GROW, GROW-DIAG, GROW-DIAG-FINAL,GROW-DIAG-FINAL-AND, and asymmetric alignments • All achieve better recall, but much worse precision & F1 • Problem: too little data for unsupervised learning • Need to compensate by exploiting external lexical resources

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The Stanford RTE aligner • Token-based alignments: map from H tokens to P tokens • Phrase alignments not directly representable • But named entities, collocations collapsed in pre-processing • Exploits external lexical resources • WordNet, LSA, distributional similarity, string sim, … • Syntax-based features to promote aligning corresponding predicate-argument structures • Decoding & learning similar to MANLI

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: Stanford RTE aligner • Better F1 than MT aligners — but recall lags precision • Stanford does poor job aligning function words • 13% of links in gold are prepositions & articles • Stanford misses 67% of these (MANLI only 10%) • Also, Stanford fails to align multi-word phrases peace activists ~ protestors, hackers ~ non-authorized personnel * * * * * includes (generous) correction for missed punctuation

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: MANLI aligner • MANLI outperforms all others on every measure • F1: 10.5% higher than GIZA++, 6.2% higher than Stanford • Good balance of precision & recall • Matched >20% exactly

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion MANLI results: discussion • Three factors contribute to success: • Lexical resources: jail ~ prison, prevent ~ stop , injured ~ wounded • Contextual features enable matching function words • Phrases: death penalty ~ capital punishment, abdicate ~ give up • But phrases help less than expected! • If we set max phrase size = 1, we lose just 0.2% in F1 • Recall errors: room to improve • 40%: need better lexical resources: conservation ~ protecting, organization ~ agencies, bone fragility ~ osteoporosis • Precision errors harder to reduce • function words (49%), be (21%), punct (7%), equal lemmas (18%)

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Can aligners predict RTE answers? • We’ve been evaluating against gold-standard alignments • But alignment is just one component of an NLI system • Does a good alignment indicate a valid inference? • Not necessarily: negations, modals, non-factives & implicatives, … • But alignment score can be strongly predictive • And many NLI systems rely solely on alignment • Using alignment score to predict RTE answers: • Predict YES if score > threshold • Tune threshold on development data • Evaluate on test data

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: predicting RTE answers • No NLI aligner rivals top LCC system • But, Stanford & MANLI beat average entry for RTE2 • Many NLI systems could benefit from better alignments!

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Related work • Lots of past work on phrase-based MT • But most systems extract phrases from word-aligned data • Despite assumption that many translations are non-compositional • Recent work jointly aligns & weights phrases[Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] • However, this is of limited applicability to the NLI task • MANLI uses phrases only when words aren’t appropriate • MT uses longer phrases to realize more dependencies(e.g. word order, agreement, subcategorization) • MT systems don’t model word insertions & deletions

Introduction • NLI vs. MT• The MSR Data • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion :-) Thanks! Questions? Conclusion • MT aligners not directly applicable to NLI • They rely on unsupervised learning from massive amount of bitext • They assume semantic equivalence of P & H • MANLI succeeds by: • Exploiting (manually & automatically constructed) lexical resources • Accommodating frequent unaligned phrases • Phrase-based representation shows potential • But not yet proven: need better phrase-based lexical resources

END

Two Aspects of the Problem of Natural Language Inference

Two Aspects of the Problem of Natural Language Inference

Presentation Transcript

Natural Language Inference

Two aspects of the global water crisis

Two Related Approaches to the Problem of Textual Inference

The Fundamental Problem of Causal Inference

Aspects of language

Language aspects of algebra

Natural Logic and Natural Language Inference

Two Styles of Language

Aspects of Language

A Comparative Study of Two Natural Language Processing Frameworks

The Fundamental Problem of Causal Inference

Representation and Inference for Natural Language

Inference about the ratio of two variances

Bayesian inference of binomial problem

Natural Language Processing for Automated Inference

Natural Language Inference

A Phrase-Based Model of Alignment for Natural Language Inference

Natural Logic and Natural Language Inference

Aspects of Language

Two Aspects of the Problem of Natural Language Inference

Language aspects of algebra