Predicting HTER using Automatic Scoring Metrics

Predicting HTER using Automatic Scoring Metrics Matthew Snover1, Richard Schwartz2, Bonnie J. Dorr1 1University of Maryland, College Park 2BBN Technologies, Inc.

Motivation and Goal • HTER measures the number of edits needed to correct a system output so that it is both fluent and adequate. • Requires a human to create a targeted reference • Expensive and slow • HTER is impractical for tuning and regular evaluation • Choice of automatic evaluation metric (BLEU, TER, METEOR) for tuning and development is unclear • Goal: Find a new automatic measure that appropriately weights scores from existing automatic scoring to better predict HTER

Automatic Metrics vs. HTER • BLEU, METEOR, & TER correlate well with HTER on the document level • Combination of metrics may be ideal BLEU vs. HTER TER vs. HTER METEOR vs. HTER

Automatic Metrics vs. HTER • Pearson correlation between Automatic Metric and HTER on GALE 2006 Data • Best metric varies across language and data type

Artificial Neural Network (ANN) • 3 hidden nodes, with tan sigmoid transfer functions • Feed forward network trained with back-propagation f1 BLEU h1 f2 METEOR Predicted HTER h2 o f3 TER h3 Other Features fn

Features Used • TER • TER score, insertion rate, deletion rate, substitution rate, shift rate, number of words shifted • BLEU • BLEU(1), BLEU(2), BLEU(3), BLEU(4), 1-gram precision, 2-gram precision, 3-gram precision, 4-gram precision • METEOR • METEOR score, match rate, chunk rate, precision, recall, f-mean, 1-factor, fragmentation, length penalty • Output Features • # hypothesis words, # reference words • Source Features (only for text data) • OOV rate, 1-gram hit rate, 2-gram hit rate, 3-gram hit rate, log perplexity

Experiment • GALE 2006 system outputs from all teams • Separated by language and type (text vs. audio) • 10-fold cross validation used to train and test • Neural net trained on segment level features to predict segment HTER • Predicted HTER of segments combined in weighted average to obtain predicted HTER of documents • Predicting document HTER from segment level features outperformed prediction from document level features • HTER-ANN outputs predicted HTER scores for documents

Results (Arabic and Chinese Text) • r is Pearson correlation of HTER with automatic scores • Original r: correlation with original measure • ANN r: correlation with HTER predicted by ANN • Best single metric varies with language and data type • Additional features improve prediction and correlation

Perplexity to predict HTER • Using only perplexity and other source features to predict HTER gives surprisingly good results • No features used from actual translation • Source features reflect document difficulty

Results (Arabic and Chinese Audio) • Different single metrics correlate best • Larger gains in correlation for all features

Conclusions • HTER-ANN always provides a gain over a single metric • The best single metric varies with language and data type • Gain typically not large • Higher gains for Chinese than Arabic • HTER-ANN provides mechanism for choosing which automatic scoring metric to use and how to weight them • No single automatic scoring metric performed as well across all languages and data types as the HTER-ANN • While HTER-ANN cannot replace humans in HTER process, it does free researchers from the worry of choice of evaluation metric when developing and tuning.

Predicting HTER using Automatic Scoring Metrics

Predicting HTER using Automatic Scoring Metrics

Presentation Transcript

Using Excel for Test Metrics

Predicting visual performance from wavefront quality metrics in cataract

Predicting Class Testability using Object-Oriented Metrics

Predicting Phospholipidosis Using Machine Learning

Predicting Using Story Clues!

Scoring Metrics for IDC, CDC and EDC

Predicting Genetic Regulatory Response Using Classification

Scoring:

Predicting Inheritance using Punnett Squares

Predicting Bugs Using Antipatterns

Ontology-Based Argument Mining and Automatic Essay Scoring

Automatic Scoring of Online Discussion Posts

Using Scoring Guides…

PREDICTING INTERPLANETARY SHOCKS USING NEURAL NETWORKS

Scoring Process using Oscar

Automatic Essay Scoring

Scoring

PREDICTING UNROLL FACTORS USING SUPERVISED LEARNING

Software Metrics using EiffelStudio

Using Metrics to Drive Efficiency

Predicting Polyolefin Foamability Using Melt Rheology

Predicting Air Pollution using TAPM