250 likes | 387 Views
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation. or Orange: a Method for Automatically Evaluating Automatic Evaluation Metrics for Machine Translation. Chin-Yew Lin & Franz Josef Och (presented by Bilmes). Summary.
E N D
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation orOrange: a Method for Automatically Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes)
Summary • 1. Introduces ORANGE, a way to automatically evaluate automatic evaluation methods • 2. Introduces 3 new ways to automatically evaluate MT systems: ROUGE-L, ROUGE-W, and ROUGE-S • 3. Uses ORANGE to evaluate many different evaluation methods, and finds that their new one, ROUGE-S4 is the best evaluator
Reminder: Adequacy & Fluency • Adequacy refers to the degree to which the translation communicates information present in the original. Roughly, a translation using the same words (1-grams) as the reference tends to satisfy adequecy. • Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. Roughly, the longer the n-gram matches of a translation with a reference, tends to improve fluency.
Reminder: BLEU • unigram precision = num translation unigrams that appear in reference translation/candidate translation length • modified unigram precision = clipped(num trans. unigrams in ref translation)/cand. trans. length • clipping maxes out to the max count in any ref. translation • modified n-gram precision (same thing) • On blocks of text: • Brevity penalty (since short sentences can get high low-gram precision) • Finally:
Still Other Reminders • Person’s product moment correlation coefficient: • just the normal correlation coefficient r2 = (EXY)2/EX2EY2 • Spearman’s rank order correlation coefficient • same thing for normal orders, or otherwise: • Di = rankA(i) – rankB(i) • Bootstrap method to compute confidence intervals • resample with replacement from data N times, compute mean and get confidence interval of val +/- 2se(val) for 95% confidence interval.
On to the paper:Lots of ways to evaluate MT quality • BLEU • RED • WER – length-normalized edit distance • PER – position independent word error rate (bag of words approach) • GTM – general text matcher, based on a balance of recall, precision, & their F-measure combination (we should do this paper) • This paper now introduces still three more such metrics: ROUGE-L, ROUGE-W, and ROUGE-S (which we shall define).
Corr Coeff & 95% CIs of 8 MT systems in NIST03 Chinese-English, using various MT evaluation methods
Problem is we need a way to automatically evaluate these automatic evaluation methods. • Since we don’t know which one is best, which one to use, how and when to choose, etc. • Try to break out of the region of insignificant difference. • Question: what about meta regress: do we need a way to automatically evaluate automatic evaluations of automatic evaluation methods? • Anyway, goal of this paper (other than introducing new automatic evaluation methods) is to introduce ORANGE: Oracle Ranking for Gisting Evaluation (or the first automatic evaluation of automatic MT evaluation methods).
ORANGE • Intuitively: uses translations “rank” as scored by MT evaluation system (so good translations should have high rank, poor ones should have low rank) • reference translations should have higher rank. • Key quantity: average rank of reference translations within combined machine and reference translation list. • ORANGE = average rank / N in N-best list. So, reference translations were ranked 2 and 3 in this list. Avg rank = 2.5. Smaller the better. • The bank was visited by me yesterday. • I went to the bank yesterday • Yesterday, I went to the bank. • Yesterday, the bank had the opportunity to be visited by me, and in fact this did indeed occur. • There was once this back that at least as of yesterday existed, and so did I, and a funny thing happened …
ORANGE • The way they calculate ORANGE in this work: • Oraclei = reference transcription I • N = size of N-best list • S = number of sentences in corpus • Rank(Oraclei) = average rank of source sentence i’s reference translations in n-best list i.
Three new metrics • ROUGE-L • ROUGE-W • ROUGE-S
Computing: Longest Common Subsequences Key thing: This does not require consecutive matches in strings. Ex: LCS(X,Y) = 3 - police killed the gunman - police kill the gunman 1
ROUGE-L • Basically, an “F-measure” (or combination) of two normalized LCSs when • Again, no consecutive matches necessary • automatically includes longest in-sequence common n-gram.
ROUGE-L Reference ROUGE-L = 3/5 ROUGE-L = 1/2 two candidates
ROUGE-L • Basically, an “F-measure” (or combination) of two normalized LCSs when • Again, no consecutive matches necessary • automatically includes longest in-sequence common n-gram. • problem: counts only main in-sequence words, other LCSs and shorter CSs are not counted
Computing: ROUGE-W score so that consecutive matches should be awarded more than non-consecutive matches.
ROUGE-S • Another “F-measure” but here using skip-bigram co-occurance statistics (i.e., non-consecutive but same order bi-grams). Goal is to measure overlap of skip-bigrams. • We use function SKIP2(X,Y) to measure number of common skip-bigrams in X and Y.
ROUGE-S • Using the SKIP2() function: • No consecutive matches required, but still respects word order • counts *all* in-order matching word pairs (LCS only counts longest common subsequence) • Can impose limit on max skip distance • ROUGE-Sn, has max skip distance of n (e.g., ROUGE-S4)
Setup • ISI’s A1Temp System • 2002 NIST Chinese-English evaluation corpus • 872 source sentences, 4 ref trans. each • 1024-best lists used
Evaluating BLEU with ORANGE • smoothed BLEU: