390 likes | 619 Views
Linguistic Information for Measuring Translation Quality Lucia Specia L.Specia@wlv.ac.uk http://pers-www.wlv.ac.uk/~in1316/ LIHMT Workshop, Barcelona November 18, 2011. In an ideal world.
E N D
Linguistic Information for Measuring Translation QualityLucia SpeciaL.Specia@wlv.ac.uk http://pers-www.wlv.ac.uk/~in1316/ LIHMT Workshop, BarcelonaNovember 18, 2011
In an ideal world... • Linguistic information is seamlessly combined to statistical information as part of translation systems to produce perfect translations • We are moving in that direction: • Morphology • Syntax • Semantics (SRL): • (Wu & Fung 2009) • (Liu & Gildea 2010) • (Aziz et al. 2011) Meanwhile…
Outline • Linguistic information to evaluate MT quality • Based on reference translations • Linguistic information to estimate MT quality • Using machine learning • Linguistic information to detect errors in MT • Automatic post-editing
MT evaluation • Handle variations in MT (words and structure) wrt referenceoridentify differences between MT and reference • METEOR(Denkowski & Lavie 2011): words and phrases • (Giménez & Màrquez 2010): matching of lexical, syntactic, semantic and discourse units • (Lo & Wu 2011): SRL and manual matching of ‘who’ did ‘what’ to ‘whom’, etc. • (Rios et al. 2011): automatic SRL with automatic (inexact) matching of predicates and arguments
MT evaluation • Essentially: matchingof linguistic units • Similar to n-gram matching metrics, but units are not only words • Metrics based on lexical units perform better • Issues: • Lack of (good) resources for certain languages • Unreliable processing of incorrect translations • Sparsity for sentence-level: depending on the actual features. E.g.: matching of named entities
MT Quality Estimation • Goal: given the output of an MT system for a given input, provide an estimate of its quality • Uses • Filter bad quality translations from post-editing • Select “perfect” translations for publishing • Spot unreliable translations to readers of target language only • Select best translation for a given input when multiple MT/TM systems are available
The task of QE for MT • NOT standard MT evaluation: • Reference translations are NOTavailable • Estimation for unseen translations • My approach: • Translation unit: sentence • Independent from MT system
General approach • Define aspect of quality to estimate and how to represent it • Identify and extract features that explain that aspect of quality • Collect examples of translations with different levels of quality and annotate them • Learn a model to predict quality scores for new translations and evaluateit
Features Adequacy indicators Source text MT system Translation Quality? Complexity indicators Confidence indicators Fluency indicators Features can be shallow or linguistically motivated
Shallow features These do well for estimation of general quality wrt post-editing needs, but not enough for other aspects of quality… • (S/T/S-T)Sentence length • (S/T) Language model • (S/T) Token-type ratio • (S) Readability metrics: Flesch, etc • (S) Average number of possible translations per word • (S) % of n-grams belonging to different frequency quartiles of a source language corpus • (T) Untranslated/OOV words • (T) Mismatching brackets, quotation marks • (S-T) Preservation of punctuation • (S-T) Word alignment score • etc
Linguistic features Count-based • (S/T/S-T) Content/non-content words • (S/T/S-T) Nouns/verbs/… NP/VP/… • (S/T/S-T) Deictics (references) • (S/T/S-T) Discourse markers (references) • (S/T/S-T) Named entities • (S/T/S-T) Zero-subjects • (S/T/S-T) Pronominal subjects • (S/T/S-T) Negation indicators • (T) Subject-verb / adjective-noun agreement • (T) Language Model of POS • (T) Grammar checking (dangling words) • (T) Coherence
Linguistic features Some features are language-dependent, others need resources that are language-dependent, but apply to most languages, e.g. LM of POS tags Alignment-based • (S-T) Correct translation of pronouns • (S-T) Matching of dependency relations • (S-T) Matching of named entities • (S-T) Alignment of parse trees • (S-T) Alignment of predicates & arguments • etc
Linguistic features … • Count-based feature representation: • Source/target only: count or proportion • Contrastive features (S-T): very important – but not a simple matching of linguistic units • Alignment may not be possible (e.g. clauses/phrases) • Force same linguistic phenomena in S an T? • Vs translated as Ns How to model different linguistic phenomena? S = linguistic unit in source; T = linguistic unit in target
Linguistic features • Count-based feature representation: • Monotonicityof features • Sparsity: is 0-0 as good as 10-10? • Our representation: precision and recall • Does not rely on alignment • Upper bound = 1 (also holds for S,T=0) • Lower bound = 0
Linguistic features – other work • S-T: (Pighin and Màrquez 2011): learn expected projection of SRL from source to target • S-T: (Xiong et al 2010) • Target LM of words and POS tags, dangling words (link grammar parser), word posterior probabilities • S-T: (Bach et al 2011) • Sequences of words and POS tags, context, dependency structures, alignment info Fine grained – need a lot of training data: 72K sentences, 2.2M words and their manual correction (!)
Quality Aspect & Annotation • Estimating post-editing effort • Human scores (1-4): how much post-editing effort? • Estimating adequacy • Human scores (1-4): to which degree does the translation convey the meaning of the original text?
Learning framework • Machine learning algorithm: SVM for regression • Evaluation • Root Mean Square Error (RMSE)
Post-editing effort estimation • English-SpanishEuroparl data • 4 SMT systems 4 sets of 4,000 {source, translation, score} triples • Quality score: 1-4 post-editing effort • Features: 96 shallow versus 169 shallow + ling:
Post-editing effort estimation • Distribution of post-editing effort scores:
Post-editing effort estimation Deviation of 17-22% RMSE:
Adequacy estimation MT: The student still has claimed to take the exam at the end of the year - although she has not chosen course. SRC: A estudante ainda tem pretensão de prestar vestibular no fim do ano – embora não tenha escolhido o curso REF: The student still has the intention to take the exam at the end of the year – although she has not chosen the course.
Adequacy estimation • Arabic-English Newswire data (GALE) • 2 SMT systems (Rosetta team) 2 sets of 2,585 {source, translation, score} triples • Quality score: 1-4 adequacy • Features: 82 shallow versus 122 shallow + ling:
Adequacy estimation • Distribution of adequacy scores:
Adequacy estimation Deviation of 14-26% • RMSE :
Feature analysis • Best performing: • Length(words, content-words, etc.) • Absolute numbers are better than proportions • Language model / corpus frequency • Ambiguityof source words • Shallow featuresare better than linguistic features • Except for one adequacy estimation system • Source/target features are better than contrastivefeatures (shallow and linguistic) • Absolute numbers are better than proportions
Linguistic features • Issues: • Feature representation • Sparsity • Need deeper features for adequacy estimation • Annotation: • 1-4 post-editing effort: could be more objective • 1-4 adequacy: can we isolate adequacy from fluency? • Language-dependency • Reliability of resources • Low quality translations • Availability of resources
Error detection • General vs specific errors • Bottom-up approach: word-based CE • (Xiong et al 2010) • Word posterior probability, dangling words (link grammar parser), target words & POS patterns • (Bach et al 2011) • Dependency relations, words and POS patterns, e.g. relate target words to patterns of POS tags in source
Error detection • (Bach et al 2011): best features are source-based
Error detection • ~700 errors / 150 sentences • 42 error categories : a few rules per category… • Top-down approach (on-going work) • Corpus-based analysis: generalize errors in categories • Portuguese-English • 150 sentences (2 domains, 2 MT systems) • RBMT: more systematic errors
Conclusions • It is possible to estimate the quality of MT systems wrtpost-editing needs using shallow, language- and system-independent features • Adequacy estimation is a harder problem • Need more complex linguistic features… • Linguistic features are relevant: • Directly useful for error detection (word-level CE) • Directly useful for automatic post-editing • But… for sentence-level CE: • Issues with sparsity • Issues with representation: length bias
Thanks! Lucia Specia l.specia@wlv.ac.uk
References Aziz, W., Rios, M., Specia, L. (2011). Shallow Semantic Trees for SMT. WMT Denkowski, M. and Lavie. A. 2011. Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems, WMT. Giménez, J. and Màrquez, L. 2010. Linguistic Measures for Automatic Machine Translation Evaluation. Machine Translation, Volume 24, Numbers 3-4. Hardmeier, C. 2011. Improving Machine Translation Quality Prediction with Syntactic Tree Kernels. EAMT-2011. Liu, D. and Gildea, D. 2010. Semantic role features for machine translation. 23rd Conference on Computational Linguistics. Pado, S., Galley, M., Jurafsky, D., and Manning, C. 2009. Robust Machine Translation Evaluation with Entailment Features. ACL.
References Pighin, D. and Màrquez, L. 2011. Automatic Projection of Semantic Structures: an Application to Pairwise Translation Ranking, SSST-5. Tatsumi, M. and Roturier, J. 2010. Source Text Characteristics and Technical and Temporal Post-Editing Effort : What is Their Relationship ?, 43-51. 2nd JEC Workshop. Wu,D. and Fung, P. 2009. Semantic roles for SMT: a hybrid two-pass model. HLT/NAAACL. Xiong, D., Zhang, M. and Li, H. 2010. Error Detection for SMT Using Linguistic Features. ACL-2010.
En-Es Europarl - [1- 4] Best features (Pearson’s correlation) (S3 en-es):
En-Es Europarl - [1- 4] • Filtering out bad translations: 1-2 (S3 en-es) • Average human scores in the top n translations:
En-Es Europarl - [1- 4] QE x MT metrics: Pearson’s correlation (S3 en-es)
En-Es Europarl - [1- 4] • QE score x MT metrics: Pearson’s correlation across MT systems:
MT features (confidence) • SMT model global score and internal features • Distortion count, phrase probability, ... • % search nodes aborted, pruned, recombined … • Language model using n-best list as corpus • Distance to centre hypothesis in the n-best list • Relative frequency of the words in the translation in the n-best list • Ratio of SMT model score of the top translation to the sum of the scores of all hypothesis in the n-best list, …
Feature analysis • Best performing: • Length(words, content-words, etc.) • Absolute numbers are better than proportions • Language model / corpus frequency • Ambiguityof source words • Shallow featuresare better than linguistic features • Except for one adequacy estimation system • Source/target features are better than contrastivefeatures (shallow and linguistic) • Absolute numbers are better than proportions