250 likes | 271 Views
A Phrase-Based Model of Alignment for Natural Language Inference. Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 26 October 2008. Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion.
E N D
A Phrase-Based Model of Alignmentfor Natural Language Inference Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 26 October 2008
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Natural language inference (NLI) (aka RTE) • Does premise P justify an inference to hypothesis H? • An informal notion of inference; variability of linguistic expression P Gazprom today confirmed a two-fold increase in its gas price for Georgia, beginning next Monday. H Gazprom will double Georgia’s gas bill. yes • Like MT, NLI depends on a facility for alignment • I.e., linking corresponding words/phrases in two related sentences
unaligned content: “deletions” from P approximate match: price ~ bill phrase alignment: two-fold increase ~ double Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Alignment example H (hypothesis) P (premise)
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Approaches to NLI alignment • Alignment addressed variously by current NLI systems • In some approaches to NLI, alignments are implicit: • NLI via lexical overlap [Glickman et al. 05, Jijkoun & de Rijke 05] • NLI as proof search [Tatu & Moldovan 07, Bar-Haim et al. 07] • Other NLI systems make alignment step explicit: • Align first, then determine inferential validity [Marsi & Kramer 05, MacCartney et al. 06] • What about using an MT aligner? • Alignment is familiar in MT, with extensive literature[Brown et al. 93, Vogel et al. 96, Och & Ney 03, Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] • Can tools & techniques of MT alignment transfer to NLI?
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion NLI alignment vs. MT alignment Doubtful — NLI alignment differs in several respects: • Monolingual: can exploit resources like WordNet • Asymmetric: P often longer & has content unrelated to H • Cannot assume semantic equivalence • NLI aligner must accommodate frequent unaligned content • Little training data available • MT aligners use unsupervised training on huge amounts of bitext • NLI aligners must rely on supervised training & much less data
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Contributions of this paper In this paper, we: • Undertake the first systematic study of alignment for NLI • Existing NLI aligners use idiosyncratic methods, are poorly documented, use proprietary data • Examine the relation between alignment in NLI and MT • How do existing MT aligners perform on NLI alignment task? • Propose a new model of alignment for NLI: MANLI • Outperforms existing MT & NLI aligners on NLI alignment task
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MANLI aligner A model of alignment for NLI consisting of four components: Phrase-based representation Feature-based scoring function Decoding using simulated annealing Perceptron learning
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Phrase-based alignment representation Represent alignments by sequence of phrase edits: EQ, SUB, DEL, INS EQ(Gazprom1, Gazprom1) INS(will2) DEL(today2) DEL(confirmed3) DEL(a4) SUB(two-fold5increase6, double3) DEL(in7) DEL(its8) … • One-to-one at phrase level (but many-to-many at token level) • Avoids arbitrary alignment choices; can use phrase-based resources
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion A feature-based scoring function • Score edits as linear combination of features, then sum: • Edit type features: EQ, SUB, DEL, INS • Phrase features: phrase sizes, non-constituents • Lexical similarity feature: max over similarity scores • WordNet: synonymy, hyponymy, antonymy, Jiang-Conrath • Distributional similarity à la Dekang Lin • Various measures of string/lemma similarity • Contextual features: distortion, matching neighbors
Start … Generate successors Score Smooth/sharpen P(A) = P(A)1/T Sample Lower temp Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion T = 0.9 T Repeat Decoding using simulated annealing … 100 times
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Perceptron learning of feature weights We use a variant of averaged perceptron [Collins 2002] Initialize weight vector w = 0, learning rate R0 = 1 For training epoch i = 1 to 50: For each problem Pj, Hj with gold alignment Ej: Set Êj = ALIGN(Pj, Hj, w) Set w = w + Ri ((Ej) – (Êj)) Set w = w / ‖w‖2 (L2 normalization) Set w[i] = w (store weight vector for this epoch) Set Ri = 0.8 Ri–1 (reduce learning rate) Throw away weight vectors from first 20% of epochs Return average weight vector Training runs require about 20 hours (on 800 RTE problems)
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The MSR RTE2 alignment data • Previously, little supervised data • Now, MSR gold alignments for RTE2 • [Brockett 2007] • dev & test sets, 800 problems each • Token-based, but many-to-many • allows implicit alignment of phrases • 3 independent annotators • 3 of 3 agreed on 70% of proposed links • 2 of 3 agreed on 99.7% of proposed links • merged using majority rule
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Evaluation on MSR data • We evaluate several systems on MSR data • A simple baseline aligner • MT aligners: GIZA++ & Cross-EM • NLI aligners: Stanford RTE, MANLI • How well do they recover gold-standard alignments? • We report per-link precision, recall, and F1 • We also report exact match rate for complete alignments
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Baseline: bag-of-words aligner • Surprisingly good recall, despite extreme simplicity • But very mediocre precision, F1, & exact match rate • Main problem: aligns every token in H Match each H token to most similar P token: [cf. Glickman et al. 2005]
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion MT aligners: GIZA++ & Cross-EM • Can we show that MT aligners aren’t suitable for NLI? • Run GIZA++ via Moses, with default parameters • Train on dev set, evaluate on dev & test sets • Asymmetric alignments in both directions • Then symmetrize using INTERSECTION heuristic • Initial results are very poor: 56% F1 • Doesn’t even align equal words • Remedy: add lexicon of equal words as extra training data • Do similar experiments with Berkeley Cross-EM aligner
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: MT aligners Similar F1, but GIZA++ wins on precision, Cross-EM on recall • Both do best with lexicon & INTERSECTION heuristic • Also tried UNION, GROW, GROW-DIAG, GROW-DIAG-FINAL,GROW-DIAG-FINAL-AND, and asymmetric alignments • All achieve better recall, but much worse precision & F1 • Problem: too little data for unsupervised learning • Need to compensate by exploiting external lexical resources
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion The Stanford RTE aligner • Token-based alignments: map from H tokens to P tokens • Phrase alignments not directly representable • (But, named entities & collocations collapsed in pre-processing) • Exploits external lexical resources • WordNet, LSA, distributional similarity, string sim, … • Syntax-based features to promote aligning corresponding predicate-argument structures • Decoding & learning similar to MANLI
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: Stanford RTE aligner • Better F1 than MT aligners — but recall lags precision • Stanford does poor job aligning function words • 13% of links in gold are prepositions & articles • Stanford misses 67% of these (MANLI only 10%) • Also, Stanford fails to align multi-word phrases peace activists ~ protestors, hackers ~ non-authorized personnel * * * * * includes (generous) correction for missed punctuation
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: MANLI aligner • MANLI outperforms all others on every measure • F1: 10.5% higher than GIZA++, 6.2% higher than Stanford • Good balance of precision & recall • Matched >20% exactly
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion MANLI results: discussion • Three factors contribute to success: • Lexical resources: jail ~ prison, prevent ~ stop , injured ~ wounded • Contextual features enable matching function words • Phrases: death penalty ~ capital punishment, abdicate ~ give up • But phrases help less than expected! • If we set max phrase size = 1, we lose just 0.2% in F1 • Recall errors: room to improve • 40%: need better lexical resources: conservation ~ protecting, organization ~ agencies, bone fragility ~ osteoporosis • Precision errors harder to reduce • equal function words (49%), forms of be (21%), punctuation (7%)
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Can aligners predict RTE answers? • We’ve been evaluating against gold-standard alignments • But alignment is just one component of an NLI system • Does a good alignment indicate a valid inference? • Not necessarily: negations, modals, non-factives & implicatives, … • But alignment score can be strongly predictive • And many NLI systems rely solely on alignment • Using alignment score to predict RTE answers: • Predict YES if score > threshold • Tune threshold on development data • Evaluate on test data
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Results: predicting RTE answers • No NLI aligner rivals best complete RTE system • (Most) complete systems do a lot more than just alignment! • But, Stanford & MANLI beat average entry for RTE2 • Many NLI systems could benefit from better alignments!
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion :-) Thanks! Questions? Conclusion • MT aligners not directly applicable to NLI • They rely on unsupervised learning from massive amounts of bitext • They assume semantic equivalence of P & H • MANLI succeeds by: • Exploiting (manually & automatically constructed) lexical resources • Accommodating frequent unaligned phrases • Phrase-based representation shows potential • But not yet proven: need better phrase-based lexical resources
Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion Related work • Lots of past work on phrase-based MT • But most systems extract phrases from word-aligned data • Despite assumption that many translations are non-compositional • Recent work jointly aligns & weights phrases[Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08] • However, this is of limited applicability to the NLI task • MANLI uses phrases only when words aren’t appropriate • MT uses longer phrases to realize more dependencies(e.g. word order, agreement, subcategorization) • MT systems don’t model word insertions & deletions