110 likes | 261 Views
Predicting HTER using Automatic Scoring Metrics. Matthew Snover 1 , Richard Schwartz 2 , Bonnie J. Dorr 1 1 University of Maryland, College Park 2 BBN Technologies, Inc. Motivation and Goal.
E N D
Predicting HTER using Automatic Scoring Metrics Matthew Snover1, Richard Schwartz2, Bonnie J. Dorr1 1University of Maryland, College Park 2BBN Technologies, Inc.
Motivation and Goal • HTER measures the number of edits needed to correct a system output so that it is both fluent and adequate. • Requires a human to create a targeted reference • Expensive and slow • HTER is impractical for tuning and regular evaluation • Choice of automatic evaluation metric (BLEU, TER, METEOR) for tuning and development is unclear • Goal: Find a new automatic measure that appropriately weights scores from existing automatic scoring to better predict HTER
Automatic Metrics vs. HTER • BLEU, METEOR, & TER correlate well with HTER on the document level • Combination of metrics may be ideal BLEU vs. HTER TER vs. HTER METEOR vs. HTER
Automatic Metrics vs. HTER • Pearson correlation between Automatic Metric and HTER on GALE 2006 Data • Best metric varies across language and data type
Artificial Neural Network (ANN) • 3 hidden nodes, with tan sigmoid transfer functions • Feed forward network trained with back-propagation f1 BLEU h1 f2 METEOR Predicted HTER h2 o f3 TER h3 Other Features fn
Features Used • TER • TER score, insertion rate, deletion rate, substitution rate, shift rate, number of words shifted • BLEU • BLEU(1), BLEU(2), BLEU(3), BLEU(4), 1-gram precision, 2-gram precision, 3-gram precision, 4-gram precision • METEOR • METEOR score, match rate, chunk rate, precision, recall, f-mean, 1-factor, fragmentation, length penalty • Output Features • # hypothesis words, # reference words • Source Features (only for text data) • OOV rate, 1-gram hit rate, 2-gram hit rate, 3-gram hit rate, log perplexity
Experiment • GALE 2006 system outputs from all teams • Separated by language and type (text vs. audio) • 10-fold cross validation used to train and test • Neural net trained on segment level features to predict segment HTER • Predicted HTER of segments combined in weighted average to obtain predicted HTER of documents • Predicting document HTER from segment level features outperformed prediction from document level features • HTER-ANN outputs predicted HTER scores for documents
Results (Arabic and Chinese Text) • r is Pearson correlation of HTER with automatic scores • Original r: correlation with original measure • ANN r: correlation with HTER predicted by ANN • Best single metric varies with language and data type • Additional features improve prediction and correlation
Perplexity to predict HTER • Using only perplexity and other source features to predict HTER gives surprisingly good results • No features used from actual translation • Source features reflect document difficulty
Results (Arabic and Chinese Audio) • Different single metrics correlate best • Larger gains in correlation for all features
Conclusions • HTER-ANN always provides a gain over a single metric • The best single metric varies with language and data type • Gain typically not large • Higher gains for Chinese than Arabic • HTER-ANN provides mechanism for choosing which automatic scoring metric to use and how to weight them • No single automatic scoring metric performed as well across all languages and data types as the HTER-ANN • While HTER-ANN cannot replace humans in HTER process, it does free researchers from the worry of choice of evaluation metric when developing and tuning.