1 / 1

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Objective. MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly. To be tuned against, a metric should be fast and easy to compute

ulla-mckay
Download Presentation

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Objective • MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly. • To be tuned against, a metric should be fast and easy to compute • In this work, we seek to improve already established metrics using simple techniques that are not expensive to compute. Computing Ranking Correlation • Convert binary judgements into full rankings • Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements • Topologically sort the graph • For one source sentence with N hypothesis Introduction to METEOR • Computes a one to one word alignment between reference and hypothesis using a matching module • Uses unigram precision and unigram recall along with a fragmentation penalty as score components • In case of multiple references, the best scoring reference-hypothesis pair is chosen • Average across all source sentences Matching Module Results • Matcher uses various word mapping modules to identify all possible word matches • Exact : Match words with same surface forms • porter_stem: Match words with same root • wn_synonymy: Match words based on WordNet synsets • Compute the alignment having fewest number of crossing edges • 3-fold Cross validation results on the WMT 2007 data (Average Spearman Correlation)‏ • Results on the WMT 2008 data( % of correct binary judgements)‏ Matcher Example The Sri Lanka prime minister criticizes the leader of the country Flexible Matching for BLEU President of Sri Lankacriticized by prime minister of the country • The flexible matching in METEOR can be used to extend any metric that needs word to word matching • Compute the alignment between reference and hypothesis using Meteor matcher • Re-write the reference by replacing matched words with the words from hypothesis. • Compute BLEU with the new references Score Computation • Based on the alignment thus produced, unigram precision (P) and unigram recall (R) are computed • P and R are combined into a parametrized harmonic mean • To account for the differences in the order of the unigrams matched, a fragmentation penalty is computed as the ratio of the consecutive segments matched to the total number of unigrams matched • Fragmentation penalty and final score are computed as follows. Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output Abhaya Agarwal & Alon Lavie Language Technologies Institute Carnegie Mellon University Parameter Tuning • Earlier metric was tuned to get good correlations with adequacy and fluency style human judgments • Re-tuned to optimize correlation with human ranking data from last year's WMT shared task Results • Average BLEU and M-BLEU scores on WMT 2008 data‏ Parameter Tuning • No consistent gains across languages seen in correlations at the segment level on WMT 2007 data • Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008])‏ • The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements. • Since the ranges of the parameters are bounded, exhaustive search is use

More Related