1 / 23

BLEU, Its Variants & Its Critics

BLEU, Its Variants & Its Critics. Arthur Chan Prepared for Advanced MT Seminar. This Talk. Original BLEU scores (Papineni 2002) Motivation Procedure NIST: as a major BLEU variant Critics of BLEU From alternate evaluation metrics METEOR: (Lavie 2004, Banerjee 2005)

gordon
Download Presentation

BLEU, Its Variants & Its Critics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar

  2. This Talk • Original BLEU scores (Papineni 2002) • Motivation • Procedure • NIST: as a major BLEU variant • Critics of BLEU • From alternate evaluation metrics • METEOR: (Lavie 2004, Banerjee 2005) • From analysis of BLEU (Culy 2002) • METEOR will be covered by Alon (next talk)

  3. Bilingual Evaluation Understudy (BLEU)

  4. Motivation of Automatic Evaluation in MT • Human evaluations of MT weigh many aspects such as • Adequacy • Fidelity • Fluency • Human evaluation are expensive • Human evaluation could take a long time • While system need daily change • Good automatic evaluation could save human

  5. BLEU – Why is it Important? • Some reasons: • It is proposed by IBM • IBM has a long history of proposing evaluation standards • Verified and Improved by NIST • So, its variant is used in evaluation • Widely used • Appear everywhere in MT literature after 2001 • It is quite useful • does give good feedback to the adequacy and fluency for translation results • It is not perfect • It is a subject of criticism (the critics make some sense in this case) • It is a subject of extension

  6. BLEU – Its Motivation • Central Idea: • “The closer a machine translation is to a professional human translation, the better it is.” • Implication • A evaluation metric could be evaluated • If it correlates with human evaluation, it would be a useful metric • BLEU was proposed • as an aid • as a quick substitute of humans when needed

  7. BLEU – What is it? A Big Picture • Require multiple good reference translations • Depends on modified n-gram precision (or co-occurrence) • Co-occurrence: if translated sentence hit n-gram in any reference sentences • Per-corpus n-gram co-occurrence is computed • n can has several values and a weighted sum is computed • Brevity of translation is penalized

  8. BLEU – N-gram Precision: a Motivating Example Candidate 1: It is a guide to action which ensures that the military always obey the commands the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party.

  9. BLEU – Modified N-gram Precision • Issues with N-gram precision • Give a very good score for over generated n-gram

  10. BLEU – Brevity Penalty

  11. BLEU – The “Trouble” with Recall

  12. BLEU – Recall and Brevity Penalty

  13. BLEU – Paradigm of Evaluation

  14. BLEU – Evaluation of the Metric

  15. BLEU – The Human Evaluation

  16. BLEU – BLEU vs Human Evaluation

  17. NIST – As a BLEU’s Variant

  18. Usage of BLEU on Character-based Language

  19. Critics of BLEU – From Analysis of BLEU

  20. Critics of BLEU – A Glance of Metrics Beyond BLEU

  21. Critics of BLEU – Summary of BLEU’s Issues

  22. Discussion - Should BLEU be the Standard Metric of MT?

  23. References • Kishore Panineni, Salim Roukos, Todd Ward and Wei Jing Zhu, BLEU, a Method for Automatic Evaluation of Machine Translation. In ACL-02. 2002 • George Doddington, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics. • Etiene Denoual, Yves Lepage, BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters. • Alon Lavie, Kenji Sagae, Shyamsundar Jayaraman, The Significance of Recall in Automatic Metrics for MT Evaluation. • Christopher Culy, Susanne Z. Riechemann, The Limits of N-Gram Translation Evaluation Metrics. • Santanjeev Banerjee, Alon Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.

More Related