210 likes | 456 Views
Baselines for Recognizing Textual Entailment. Ling 541 Final Project Terrence Szymanski. What is Textual Entailment? . Informally: A text T entails a hypothesis H if the meaning of H can be inferred from the meaning of T. Example: T: Profits nearly doubled to nearly $1.8 billion.
E N D
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski
What is Textual Entailment? • Informally: A text T entails a hypothesis H if the meaning of H can be inferred from the meaning of T. • Example: • T: Profits nearly doubled to nearly $1.8 billion. • H: Profits grew to nearly $1.8 billion. • Entailment holds (is true).
Types of Entailment • For many entailments, H is simply a paraphrase of all or part of T. • Other entailments are less obvious: • T: Jorma Ollila joined Nokia in 1985 and held a variety of key management positions before taking the helm in 1992 • H: Jorma Ollila is the CEO of Nokia. • ~95% human level of agreement on entailment judgments
The PASCAL RTE Challenge • First challenge held in 2005 (RTE1) • 16 entries • System performances ranged from 50% to 59% accuracy. • Wide array of approaches, using word overlap, synonymy/word distance, statistical lexical relations, dependency tree matching… • Second challenge is underway (RTE2)
What is BLEU? • BLEU was designed as a metric to measure the accuracy of machine-generated translations by comparing them to human-generated gold standards. • Scores based on n-gram overlap (typically for n=1,2,3 and 4) and penalizes for brief translations. • Application for RTE?
Using the BLEU Algorithm for RTE • Proposed by Perez & Alfonseca in RTE1. • Use the traditional BLEU algorithm to capture n-gram overlap between T-H pairs. • Find a cutoff score such that a BLEU score above the cutoff implies a TRUE entailment (otherwise FALSE) • Roughly 50% accuracy: simple baseline. • However: intuitively, the BLEU algorithm is not ideal for RTE • BLEU was designed for evaluating MT systems • BLEU could be adjusted to better suit the RTE task.
Modifying the BLEU Algorithm • Entailments are normally short; thus it does not make sense to penalize them for being short. • BLEU uses a geometric mean to average the n-gram overlap for n=1,2,3, and 4 • If any value of n produces a zero score, the entire score is nullified. • Therefore: modify the algorithm to not penalize for brevity, use a linear weighted average.
Original BLEU Modified BLEU Modifying the BLEU Algorithm wi is the weighting factor (universally set to 1/N)b is the brevity factor (see paper for details)ctest,ref is the count of n-grams appearing in both test and ref, and ctest is the count of total n-grams appearing in test.
Performance Comparison • Ran both unmodified and modified BLEU algorithm on the RTE1 data sets. • Used the development set to obtain the cutoff score • Use the test set as the evaluation data
Cutoff Score for BLEU • The unmodified algorithm produces a high percentage of zero scores (67%). • Not surprisingly, the cutoff score is zero!
Cutoff Score for BLEU Two equivalent cutoff scores: 0 and 0.13. Both offer 53.8% accuracy, but the zero cutoff was used because it is a natural candidate for cutoff.
Cutoff Score for Modified BLEU • Modified BLEU produces a continuum of scores, unlike the original BLEU • Need to find the optimal cutoff score that maximizes accuracy.
Cutoff Score for Modified BLEU Optimal cutoff score is found to be 0.221
Validity of cutoff scores? • The original BLEU seems to have a good natural cutoff score of zero • The modified BLEU optimal cutoff varies depending on the data set, although 0.221 is an acceptable value (future data may be needed for optimization; also the cutoff may be task-specific).
Original BLEU Development Set: Cutoff score = zero Accuracy = 53.8% Test Set: Accuracy = 52.0% Modified BLEU Development Set: Cutoff score = 0.221 Accuracy = 57.8% Test Set: Accuracy = 53.8% Results on RTE1 Data
Original BLEU Development Set: Cutoff score = zero Accuracy = 56.0% Test Set: ??? Modified BLEU Development Set: Cutoff score = 0.221 Accuracy = 60.4% Cutoff score = 0.25 Accuracy = 61.4% Test Set: ??? Results on RTE2 Data • RTE2 test set will be released in January.
Comparison of Results • Accuracy scores for four systems: Original BLEU, Modified BLEU, Perez & Alfonseca’s implementation of BLEU, and the best submission to the RTE1 Challenge. • Modified BLEU is better than the other versions of BLEU, but nowhere near the best system performance.
End Results • Modified BLEU algorithm outperforms the original BLEU algorithm for RTE • Consistent 2-4% increase in accuracy • Does this mean that modified BLEU is a candidate system for RTE applications?
NO: BLEU is a baseline algorithm • “Don’t climb a tree to get to the moon.” • BLEU (and other n-gram based methods) are good baselines, but lack the potential for future improvement. • Example: • T: It is not the case that John likes ice cream. • H: John likes ice cream. • Perfect n-gram overlap, but entailment is FALSE.
Future Improvements • Potential exists to add word-similarity enhancements, such as synonym substitution, etc. • Rather than think of these as enhancements to the BLEU algorithm, we should think of the BLEU algorithm as a baseline for measuring the benefit offered by such improvements. • i.e. Performance of BLEU vs. Performance of BLEU after synonym substitution. • => Evaluate the benefit synonym substitution can have on a larger RTE system.
Conclusions • The BLEU algorithm can be modified to better suit the RTE task • Modifications are theory-motivated • Eliminate brevity penalty, use linear rather than geometric mean • Performance benefits: Modified BLEU consistently has 2-4% higher accuracy. • Still, BLEU is only a baseline algorithm • Lacks the capacity to incorporate future developments • Can be used to measure performance benefits of various enhancements.