30 likes | 145 Views
Objective. MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly. To be tuned against, a metric should be fast and easy to compute
E N D
Objective • MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly. • To be tuned against, a metric should be fast and easy to compute • In this work, we seek to improve already established metrics using simple techniques that are not expensive to compute. Computing Ranking Correlation • Convert binary judgements into full rankings • Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements • Topologically sort the graph • For one source sentence with N hypothesis Introduction to METEOR • Computes a one to one word alignment between reference and hypothesis using a matching module • Uses unigram precision and unigram recall along with a fragmentation penalty as score components • In case of multiple references, the best scoring reference-hypothesis pair is chosen • Average across all source sentences Matching Module Results • Matcher uses various word mapping modules to identify all possible word matches • Exact : Match words with same surface forms • porter_stem: Match words with same root • wn_synonymy: Match words based on WordNet synsets • Compute the alignment having fewest number of crossing edges • 3-fold Cross validation results on the WMT 2007 data (Average Spearman Correlation) • Results on the WMT 2008 data( % of correct binary judgements) Matcher Example The Sri Lanka prime minister criticizes the leader of the country Flexible Matching for BLEU President of Sri Lankacriticized by prime minister of the country • The flexible matching in METEOR can be used to extend any metric that needs word to word matching • Compute the alignment between reference and hypothesis using Meteor matcher • Re-write the reference by replacing matched words with the words from hypothesis. • Compute BLEU with the new references Score Computation • Based on the alignment thus produced, unigram precision (P) and unigram recall (R) are computed • P and R are combined into a parametrized harmonic mean • To account for the differences in the order of the unigrams matched, a fragmentation penalty is computed as the ratio of the consecutive segments matched to the total number of unigrams matched • Fragmentation penalty and final score are computed as follows. Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output Abhaya Agarwal & Alon Lavie Language Technologies Institute Carnegie Mellon University Parameter Tuning • Earlier metric was tuned to get good correlations with adequacy and fluency style human judgments • Re-tuned to optimize correlation with human ranking data from last year's WMT shared task Results • Average BLEU and M-BLEU scores on WMT 2008 data Parameter Tuning • No consistent gains across languages seen in correlations at the segment level on WMT 2007 data • Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008]) • The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements. • Since the ranges of the parameters are bounded, exhaustive search is use