Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Objective • MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly. • To be tuned against, a metric should be fast and easy to compute • In this work, we seek to improve already established metrics using simple techniques that are not expensive to compute. Computing Ranking Correlation • Convert binary judgements into full rankings • Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements • Topologically sort the graph • For one source sentence with N hypothesis Introduction to METEOR • Computes a one to one word alignment between reference and hypothesis using a matching module • Uses unigram precision and unigram recall along with a fragmentation penalty as score components • In case of multiple references, the best scoring reference-hypothesis pair is chosen • Average across all source sentences Matching Module Results • Matcher uses various word mapping modules to identify all possible word matches • Exact : Match words with same surface forms • porter_stem: Match words with same root • wn_synonymy: Match words based on WordNet synsets • Compute the alignment having fewest number of crossing edges • 3-fold Cross validation results on the WMT 2007 data (Average Spearman Correlation)‏ • Results on the WMT 2008 data( % of correct binary judgements)‏ Matcher Example The Sri Lanka prime minister criticizes the leader of the country Flexible Matching for BLEU President of Sri Lankacriticized by prime minister of the country • The flexible matching in METEOR can be used to extend any metric that needs word to word matching • Compute the alignment between reference and hypothesis using Meteor matcher • Re-write the reference by replacing matched words with the words from hypothesis. • Compute BLEU with the new references Score Computation • Based on the alignment thus produced, unigram precision (P) and unigram recall (R) are computed • P and R are combined into a parametrized harmonic mean • To account for the differences in the order of the unigrams matched, a fragmentation penalty is computed as the ratio of the consecutive segments matched to the total number of unigrams matched • Fragmentation penalty and final score are computed as follows. Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output Abhaya Agarwal & Alon Lavie Language Technologies Institute Carnegie Mellon University Parameter Tuning • Earlier metric was tuned to get good correlations with adequacy and fluency style human judgments • Re-tuned to optimize correlation with human ranking data from last year's WMT shared task Results • Average BLEU and M-BLEU scores on WMT 2008 data‏ Parameter Tuning • No consistent gains across languages seen in correlations at the segment level on WMT 2007 data • Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008])‏ • The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements. • Since the ranges of the parameters are bounded, exhaustive search is use

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Presentation Transcript

Chapter 3

Test Worthiness

Partial and Semipartial Correlation

Performance Metrics

Models of Human Performance

Logic Families and Their Characteristics Part I: TTL

Chapter 12 - C++ Stream Input/Output

WISER: Bibliometrics II The Black Art of Citation Rankings

Human Computer Interaction

New Metrics for New Media

Production

Aim High!

Canonical Correlation

Introduction to Controlling the Output Power of a Transistor Stage

Chapter 9

Lecture 7: Incorporating Human Factors Engineering into Clinical Management

Localizing Dependencies: Consequences in NLP

Correlation and regression

Tool Visualizations, Metrics, and Profiled Entities Overview

Formatted Output Secure Coding in C and C++ Robert C. Seacord

High Performance Correlation Techniques For Time Series

Meteor &amp; M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Meteor &amp; M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Presentation Transcript

Chapter 3

Test Worthiness

Partial and Semipartial Correlation

Performance Metrics

Models of Human Performance

Logic Families and Their Characteristics Part I: TTL

Chapter 12 - C++ Stream Input/Output

WISER: Bibliometrics II The Black Art of Citation Rankings

Human Computer Interaction

New Metrics for New Media

Production

Aim High!

Canonical Correlation

Introduction to Controlling the Output Power of a Transistor Stage

Chapter 9

Lecture 7: Incorporating Human Factors Engineering into Clinical Management

Localizing Dependencies: Consequences in NLP

Correlation and regression

Tool Visualizations, Metrics, and Profiled Entities Overview

Formatted Output Secure Coding in C and C++ Robert C. Seacord

High Performance Correlation Techniques For Time Series

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output