The Significance of Precision and Recall in Automatic Metrics for MT Evaluation

The Significance of Precision and Recall in Automatic Metrics for MT Evaluation Alon Lavie, Kenji Sagae, Shyamsundar Jayaraman Language Technologies Institute Carnegie Mellon University

Outline • Similarity-based metrics for MT evaluation • Weaknesses in Precision-based MT metrics (BLEU, NIST) • Simple unigram-based MT evaluation metrics • METEOR • Evaluation Methodology • Experimental Evaluation • Recent Related Work • Future Directions AMTA 2004

Similarity-based MT Evaluation Metrics • Assess the “quality” of an MT system by comparing its output with human produced “reference” translations • Premise: the more similar (in meaning) the translation is to the reference, the better • Goal: an algorithm that is capable of accurately approximating the similarity • Wide Range of metrics, mostly focusing on word-level correspondences: • Edit-distance metrics: Levenshtein, WER, PIWER, … • Ngram-based metrics: Precision, Recall, F1-measure, BLUE, NIST, GTM… • Main Issue: perfect word matching is very crude estimate for sentence-level similarity in meaning AMTA 2004

Desirable Automatic Metric • High-levels of correlation with quantified human notions of translation quality • Sensitive to small differences in MT quality between systems and versions of systems • Consistent – same MT system on similar texts should produce similar scores • Reliable – MT systems that score similarly will perform similarly • General – applicable to a wide range of domains and scenarios • Fast and lightweight – easy to run AMTA 2004

Weaknesses in BLEU (and NIST) • BLUE matches word ngrams of MT-translation with multiple reference translations simultaneously Precision-based metric • Is this better than matching with each reference translation separately and selecting the best match? • BLEU Compensates for Recall by factoring in a “Brevity Penalty” (BP) • Is the BP adequate in compensating for lack of Recall? • BLEU’s ngram matching requires exact word matches • Can stemming and synonyms improve the similarity measure and improve correlation with human scores? • All matched words weigh equally in BLEU • Can a scheme for weighing word contributions improve correlation with human scores? • BLEU’s Higher order ngrams account for fluency and grammaticality, ngrams are geometrically averaged • Geometric ngram averaging is volatile to “zero” scores. Can we account for fluency/grammaticality via other means? AMTA 2004

Roadmap to a Desirable Metric • Establishing a metric with much improved correlations with human judgment score at the sentence-level will go a long way towards our overall goals • Our Approach: • Explicitly align the words in the MT translation with their corresponding matches in the reference translation, allowing for: • Exact matches, stemmed word matches, synonym and semantically-related word matches • Combine unigram Precision and Recall to account for the similarity in “content” (translation adequacy) • Weigh the contribution of matched words based on a measure related to their importance • Estimate translation fluency/grammaticality based on explicit measure related to word-order, fragmentation and/or average length of matched ngrams AMTA 2004

METEOR vs. BLEUSentence-level Scores(CMU System, 2003 Data) R=0.2466 R=0.4129 BLEU METEOR AMTA 2004

Unigram-based Metrics • Unigram Precision: fraction of words in the MT that appear in the reference • Unigram Recall: fraction of the words in the reference translation that appear in the MT • F1= P*R/0.5*(P+R) • Fmean = P*R/(0.9*P+0.1*R) • With and without word stemming • Match with each reference separately and select the best match for each sentence AMTA 2004

The METEOR Metric • Matcher explicitly aligns matched words between MT and reference • Multiple stages: exact matches, stemmed matches, (synonym matches) • Matcher returns fragment count – used to calculate average fragmentation (frag) • METEOR score calculated as a discounted Fmean score • Discounting factor: DF = 0.5 * (frag**3) • Final score: Fmean * (1- DF) AMTA 2004

Evaluation Methodology • Correlation of metric scores with human scores at the system level • Human scores are adequacy+fluency [2-10] • Pearson correlation coefficients • Confidence ranges for the correlation coefficients • Correlation of score differentials between all pairs of systems [Coughlin 2003] • Assumes a linear relationship between the score differentials AMTA 2004

Evaluation Setup • Data: DARPA/TIDES 2002 and 2003 Chinese-to-English MT evaluation data • 2002 data: • ~900 sentences, 4 reference translations • 7 systems • 2003 data: • ~900 sentences, 4 reference translations • 6 systems • Metrics Compared: BLEU, NIST, P, R, F1, Fmean, GTM, B&H, METEOR AMTA 2004

Evaluation Results:2002 System-level Correlations AMTA 2004

Evaluation Results:2003 System-level Correlations AMTA 2004

Evaluation Results:2002 Pairwise Correlations AMTA 2004

Evaluation Results:2003 Pairwise Correlations AMTA 2004

Recent Related Work • [Lin & Hovy 2003] • [Turian Shen & Melamed 2003] • [Babych & Hartley 2004] • [Soricut & Brill 2004] AMTA 2004

Current and Future Directions AMTA 2004

METEOR:Metric for Evaluation of Translation with Explicit OrderingAn Improved Automatic Metric for MT Evaluation Faculty: Alon Lavie, Jaime Carbonell Student: Rachel Reynolds

Automatic Metrics for MT Evaluation • Human evaluations are costly and time consuming • Automatic evaluation is cheap and fast • Can support both objective comparison between systems and incremental development of systems based on performance effects • Essential to have a “good” metric: • Metric scores that correlate as closely as possible to human judgments AMTA 2004

The BLEU Metric • Proposed by IBM [Papineni et al, 2002] • First serious metric proposed, used extensively over last couple of years (DARPA/TIDES, various other groups) • Main concepts in BLEU: • Compare system output with a set of reference translations • Calculate score based on n-grams matches (length 1-4) between the translation and the reference translations • Aggregate Precision = # n-gram matches / total # of n-grams in system output (for each length n-gram) • Weighed geometric averaging of the n-gram orders • Brevity Penalty if system output is too short • Compensation for not taking recall into account • Recall = # n-gram matches / total # n-grams in reference translations (for each length n-gram) AMTA 2004

BLEU: Main Problems • Recall is not explicitly accounted for • Brevety Penalty is not effective compensation for lack of recall • Notion of Grammaticality (particularly word-order) indirectly measured via higher order n-grams (2,3,4) • No explicit matching between words in translation and reference, so order cannot be explicitly compared • Matching is only on exact words (no morphological variants, synomyms, etc.) Result: metric is crude, not sensitive to anything excect for major differences between systems  Correlation with human judgements is not very high AMTA 2004

METEOR Metric: Characteristics • Explicitly word matching of translation with one or more reference translations • Calculate unigram precision/recall/F1 of each match • Calculate an explicit penalty based on how out-of-order are the matched words in the translation • Sentence score [0,1] is the score of the best match with the best reference translation • Aggregate scores for entire set similar to BLEU AMTA 2004

METEOR: Algorithm • Align translation with references • Find the reference translation with the most matches • Calculate precision, recall and F1 based using the best reference translation • Precision = # matches / length of translation • Recall = # matches / length of reference • F1 = 2 * P * R / (P + R) AMTA 2004

METEOR: Algorithm • Calculate reordering penalty factor • sort matched words in hypothesis to match the order of words in the reference • Sort factor = # flips required for sort / maximum number of flips • Max = k * (k-1) / 2, where k is the number of matched words • Final Score = F1 * (1- sort factor) • For aggregate score, keep track of total hyp length, ref length, # matches, and # flips and calculate aggregate P/R/F1 and Final-score AMTA 2004

Example • Sentences: • Hyp: Iraq to army work came to ones weapons charge for 2 weeks • Ref: the weapons in hands of Iraq that will be used by the army are to be handed over within 2 weeks • Alignment: • Hyp matches: Iraq army to weapons 2 weeks • Ref matches: weapons Iraq army to 2 weeks • Flips: weapons must be shifted 3 places • Sort factor: 3/(6*5/2) = 3/15 = 0.2 • Precision = 6/12 = 0.5 • Recall = 6/21 = 0.2857 • F1 = 0.2857/0.7857 = 0.3636 • Score = (1 – 0.2) * 0.3636 = 0.2909 AMTA 2004

Extensions to the Basic Metric • Exponential Penalty • Linear sorting penalty appeared to be very harsh – want at least a minimal credit for getting all the words right • Switch to an exponential factor penalty, currently 1/2penalty • Completely correct order: penalty-factor = 1 • Completely opposite order: penalty-factor = 0.5 • Greater distinction between systems with small penalties • We are still working on fine tuning the penalty to find the base with the best properties AMTA 2004

Extensions to the Basic Metric • Grouping of Phrases • Consecutive words in the right order are grouped together during sorting • The groups are treated as units that are moved together and can be jumped over with one flip • Sentences that are mostly in the correct order are not penalized severely for one word that is far out of place – no longer penalized for jumping over each word in the group • Overall penalty is much less harsh than before AMTA 2004

Methods for Evaluating and Comparing Metrics • Several different experimental designs for assessing the correlation between the metric scores and human judgments • System score correlations: plot metric scores vs. human scores for collection of systems tested on same data and obtain correlation coefficient • Pairwise comparison of systems: for each pair of systems plot diff. in metric score vs. diff. in human score, and calculate correlation coefficient • Sentence score correlations: plot metric scores vs. human score at the sentence-level for a particular system, and obtain correlation coefficient • Score human (reference) translations against other refs and compare the distributions of metric scores: high scores with low variance indicate a better metric AMTA 2004

Comparative Evaluation of METEOR vs. BLEU • Two data sets: 2002 and 2003 Chinese-to-English TIDES MT system evaluations • Human Evaluators • 11 Evaluators • Documents randomly distributed among evaluators • Each document graded by 2 evaluators • Human Scores • Each sentence given a fluency and adequacy score • Both scores range from 1 to 5 (5 = highest) • Quality of human evaluation is somewhat questionable (relatively low intergrader agreement) AMTA 2004

METEOR vs. BLEU: 2002 Data, System Scores R^2=0.1827 R^2=0.2081 METEOR BLEU AMTA 2004

METEOR vs. BLEU: 2002 Data, Select Systems R^2=0.7793 R^2=0.9998 METEOR BLEU AMTA 2004

METEOR vs. BLEU: 2003 Data, System Scores R^2=0.6717 R^2=.8039 BLEU METEOR AMTA 2004

METEOR: F1 vs. Full Metric 2002 Data, System Scores R^2=0.2051 R^2=0.1853 F1 Only Full METEOR AMTA 2004

METEOR vs. BLEU: 2003 Data, Pairwise System Scores R^2=0.6818 R^2=0.8320 METEOR BLEU AMTA 2004

METEOR vs. BLEUHistogram of Scores of Reference Translations2003 Data Mean=0.6504 STD=0.1310 Mean=0.3727 STD=0.2138 BLEU METEOR AMTA 2004

Summary of Results • METEOR has significantly better correlation with human judgments: • System Level, 2002 Data, select systems: +28% relative improvement (METEOR has perfect correlation) • System Level, 2003 Data: +20% relative improvement • Sentence Level, 2003 Data: 10 times better correlation • Scores of Reference Translations: dramatically higher and with far less variance AMTA 2004

Work in Progress • Tuning the exponential penalty for the right degree of differentiation and maximum penalty • Using a lattice to create new “synthetic” reference translations so that the best reference can be a combination of references • Weighing matches based on part of speech so that content words can influence the score more than function words • Matching with partial credit for words similar in meaning to a word in the reference translation: morphological variants, synonyms AMTA 2004

The Significance of Precision and Recall in Automatic Metrics for MT Evaluation