160 likes | 364 Views
Dependency-Based Automatic Evaluation for Machine Translation. Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology School of Computing Dublin City University. Overview. Automatic evaluation for Machine Translation (MT): BLEU, NIST, GTM, METEOR, TER
E N D
Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology School of Computing Dublin City University
Overview • Automatic evaluation for Machine Translation (MT): BLEU, NIST, GTM, METEOR, TER • Lexical-Functional Grammar (LFG) in language processing: parsing to simple logical forms • LFG in MT evaluation • assessing level of parser noise: the adjunct attachment experiment • checking for bias: the Europarl experiment • correlation with human judgement: the MultiTrans experiment • Future work
Automatic MT evaluation • Automatic MT metrics: fast and cheap way to evaluate your MT system • Basic and most popular: BLEU, NIST John resigned yesterdayvs.Yesterday, John quit 1-grams: 2/3 (john, yesterday) 2-grams: 0/2 3-grams: 0/1 Total = 2/6 n-grams = 0.33 • String comparison - not sensitive to legitimate syntactic and lexical variation • Need large test sets and/or multiple references
Automatic MT evaluation • Other attempts to include more variation into evaluation • General Text Matcher (GTM): precision and recall on translation-reference pairs, weights contiguous matches more • Translation Error Rate (TER): edit distance for translation-reference pair, number of insertions, deletions, substitutions and shifts • METEOR: sum of n-gram matches for exact string forms, stemmed words, and WordNet synonyms • Kauchak and Barzilay (2006): using WordNet synonyms with BLEU • Owczarzak et al. (2006): using paraphrases derived from the test set through word/phrase alignment with BLEU and NIST
S NP VP John V NP-TMP | | resigned yesterday SUBJ PRED john NUM sg PERS 3 PRED resign TENSE past ADJ PRED yesterday S NP NP VP | | | YesterdayJohn V | resigned SUBJ PRED john NUM sg PERS 3 PRED resign TENSE past ADJ PRED yesterday c-structure:f-structure: Lexical-Functional Grammar • Sentence structure representation: • c-structure (constituent): CFG trees, reflects surface word order and structural hierarchy • f-structure (functional): abstract grammatical (syntactic) relations John resigned yesterdayvs.Yesterday, John resigned triples: SUBJ(resign, john) PERS(john, 3) NUM(john, sg) TENSE(resign, past) ADJ(resign, yesterday) PERS(yesterday, 3) NUM(yesterday, sg) triples – preds only: SUBJ(resign, john) ADJ(resign, yesterday)
LFG Parser • Cahill et al. (2004): LFG parser based on Penn II Treebank (demo at http://lfg-demo.computing.dcu.ie/lfgparser.html) • Automatic annotation of Charniak’s/Bikel’s output parse with attribute-value equations, resolving to f-structures • Evaluation of parser quality: comparison of dependencies produced by the parser with the set of dependencies in human annotation of same text, precision and recall • Our LFG parser reaches high precision and recall scores
LFG in MT evaluation • Parse translation and reference into LFG f-structures rendered as dependency triples • Comparison of translation and reference text on structural (dependency) level • Calculate precision and recall on translation and reference dependency sets • Comparison of two automatically produced outputs: • how much noise does the parser introduce? John resigned yesterday SUBJ(resign, john) PERS(john, 3) NUM(john, sg) TENSE(resign, past) ADJ(resign, yesterday) PERS(yesterday, 3) NUM(yesterday, sg) Yesterday, John resigned SUBJ(resign, john) PERS(john, 3) NUM(john, sg) TENSE(resign, past) ADJ(resign, yesterday) PERS(yesterday, 3) NUM(yesterday, sg) vs. = 100%
The adjunct attachment experiment • 100 English Europarl sentences containing adjuncts or coordinated structures • Hand-modified to change the placement of the adjunct or the order of coordinated elements, no change in meaning or grammaticality Schengen, on the other hand, is not organic. <- original “reference” On the other hand, Schengen is not organic.<- modified “translation” • Change limited to c-structure, no change in f-structure • A perfect parser should give both identical set of dependencies
baseline baseline modified modified TER TER 0.0 0.0 6.417 7.841 METEOR METEOR 1.0 1.0 0.9956 0.9970 BLEU BLEU 1.0 1.0 0.8725 0.8485 NIST NIST 11.1690 11.5232 10.7422 (96.18%) 11.1704 (96.94%) Ideal parser: GTM GTM 100 100 99.35 99.18 dep dep 100 100 100 96.56 dep_preds dep_preds 100 100 100 94.13 The adjunct attachment experiment - results Parser:
The Europarl experiment • N-gram-based metrics (BLEU, NIST) favour n-gram-based translation (statistical MT) • Owczarzak et al. (2006): BLEU: Pharaoh > Logomedia (0.0349) NIST: Pharaoh > Logomedia (0.6219) Human: Pharaoh < Logomedia (0.19) • 4000 sentences from Spanish-English Europarl • Two translations: • Logomedia • Pharaoh • Evaluated with BLEU, NIST, GTM, TER, METEOR (+-WordNet), dependency-based method (basic, predicate-only, +-WordNet, +-bitext-generated paraphrases) • WordNet & paraphrases: used to create new best-matching reference for the translation, then evaluated with dependency-based method
metric PH score – LM score TER 1.997 BLEU 7.16% NIST 6.58% dep 4.93% Sets of increasing similarity: dep+paraphr 4.80% GTM 3.89% METEOR 3.80% dep_preds 3.79% dep+paraphr_preds 3.70% dep+WordNet 3.55% dep+WordNet_preds 2.60% METEOR+WordNet 1.56% The Europarl experiment - results Europarl 4000 Logomedia vs Pharaoh:
The MultiTrans experiment • Correlation of dependency-based method with human evaluation • Comparison with correlation of BLEU, NIST, GTM, METEOR, TER • Linguistic Data Consortium Multiple Translation Chinese Parts 2 and 4: • multiple translations of Chinese newswire text • four human-produced references • segment-level human scores for a subset of the translations • total: 16,800 translation-reference-human score segments • Pearson’s correlation coefficient -1 = negative correlation 0 = no correlation 1 = positive correlation
The MultiTrans experiment - results Correlation with human judgement of translation quality • Dependency-based method sensitive to grammatical structure of the sentence: more grammatical translation = more fluent translation • Different position of a word = different local (and global) structure: the word appears in dependency triples that do not match the reference
Future work • Use n-best parses to reduce parser noise and increase number of matches • Generate a paraphrase set through word alignment from a large bitext (Europarl), use instead of WordNet • Create weights for individual dependency scores that contribute to segment-level score, train to maximize correlation with human judgement
Conclusions • New automatic method for evaluation of MT output • LFG dependency triples – simple logical form • Evaluation on structural level, not surface string form • Allows legitimate syntactic variation • Allow legitimate lexical variation when used with WordNet or paraphrases • Correlates higher than other metrics with human evaluation of fluency
References Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization: 65-73. Aoife Cahill, Michael Burke, Ruth O’Donovan, Josef van Genabith, and Andy Way. 2004. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. Proceedings of ACL 2004: 320-327. George Doddington. 2002. Automatic Evaluation of MT Quality using N-gram Co-occurrence Statistics. Proceedings of HLT 2002: 138-145. David Kauchak and Regina Barzilay. 2006. Paraphrasing for Automatic Evaluation. Proceedings of HLT-NAACL 2006: 45-462. Philipp Koehn, Franz Och and Daniel Marcu. 2003. Statistical Phrase-Based Translation. Proceedings of HLT-NAACL 2003: 48-54. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of MT Summit 2005: 79-86. Philipp Koehn. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. Proceedings of the AMTA 2004 Workshop on Machine Translation: From real users to research: 115-124. Karolina Owczarzak, Declan Groves, Josef van Genabith, and Andy Way. 2006. Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation. Proceedings of the HLT-NAACL 2006 Workshop on Statistical Machine Translation: 86-93. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL 2002: 311-318. Mathew Snover, Bonnie Dorr, Richard Schwartz, John Makhoul, Linnea Micciula. 2006. A Study of Translation Error Rate with Targeted Human Annotation. Proceedings of AMTA 2006: 223-231. Joseph P. Turian, Luke Shen, and I. Dan Melamed. 2003. Evaluation of Machine Translation and Its Evaluation. Proceedings of MT Summit 2003: 386-393.