Rationale for a multilingual corpus for machine translation evaluation

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England

Outline • Brief introduction to machine translation evaluation methods • Corpus content for MT evaluation by end-users • Why compile a new corpus? • How large should our new corpus be? • Which language pairs should be included? • Which text types should be included? • Conclusions

Machine translation evaluation methods (1) Evaluation by developers • Test suites are used to evaluate the translation of specific linguistic phenomena (eg. before and after system modifications) • Test suites contain short annotated test items with correct target translations • They are used to test the handling of grammatical phenomena • Vocabulary is limited • Items are not rated in terms of frequency or relevance to a particular application • Scoring is objective

Machine translation evaluation methods (2) Evaluation by end-users • Texts are translated by different MT systems (and often humans) for comparison • Texts can be selected to reflect user needs A number of methods can be used to evaluate MT output … • Fidelity (the preservation of original content) can be evaluated by comparing segments of MT output with segments from the source text (bilingual evaluators) or from expert human translations (monolingual evaluators). Each segment is given a score • Fluency (the extent to which the translation reads like an original text) can be evaluated by scoring each target text sentence

Machine translation evaluation methods (3) Evaluation by end-users • Scoring by human evaluators is subjective, so: • Several evaluators are used and a mean score is calculated for each text • Evaluators rate a number of texts translated by each system • Human evaluation is expensive, so: • Recent research has involved the investigation of automated evaluation methods

Corpus content for MT evaluation by end-users Essential: • Source texts in one or more languages • Machine translations of source texts by systems for evaluation Not always essential: • One or more expert human translations in selected target language(s) to be used as reference translations or for inclusion in evaluation with MT output • Available evaluation scores if corpus is to be used to validate new automated evaluation methods

Why compile a new corpus for MT evaluation? (1) Existing corpora have limitations: many projects have involved the use of small numbers of texts in only one language pair

Why compile a new corpus for MT evaluation? (2) Much research has made use of the DARPA 1994 corpus: • Source texts: 100 French, 100 Spanish, 100 Japanese • All newspaper articles of approx. 300-400 words/800 Japanese characters • 2 English human translations of each source text • 5 machine translations of each source text • Scores for adequacy, fluency and informativeness for all 100 translations in each language pair by 5 MT systems and 1 human

Why compile a new corpus for MT evaluation? (3) We need: • a corpus that reflects user needs (not just newspaper articles) • a larger number of language pairs with English as a source and target language • sub-corpora (for each language pair) large enough to provide reliable evaluation results • at least one human translation and several machine translations of each source text • human evaluation results for selected attributes (eg. fidelity and fluency) for the validation of new automated evaluation methods • a corpus available to all for MT evaluation research

How large should our new corpus be? (1) The corpus cannot be unnecessarily large: • human MT evaluations are time-consuming and expensive • expert human translations of each source text, if not already available, will be costly to produce However: • we need enough words to obtain reliable MT evaluation results

How large should our new corpus be? (2) We carried out a statistical analysis of the DARPA 1994 scores for all 3 language pairs: We calculated the mean score for each attribute and the overall score for each system with varying numbers of texts (1 to 100):

DARPA 1994 (French-English) Mean adequacy scores for varying numbers of texts

DARPA 1994 (French-English) Mean overall scores for varying numbers of texts

How large should our new corpus be? (3) Results from statistical analysis: • 10 texts (3,500 words), and often fewer, allow us to identify the highest (human) and lowest ranking system for individual attributes and overall scores • 10 texts allow us to identify the highest-ranking MT system as well (but up to 30 texts required for informativeness) • After approx. 30 texts (10,500 words) scores begin to remain consistent within a relatively small variance fluctuation • After approx. 40 texts (14,000 words) we have a clearer picture of how all five MT systems compare and further sampling confirms this • Further research: the same statistical analysis will be performed using texts from our new corpus and our chosen metrics

Which language pairs should be included? (1) • A variety of language pairs allows for the testing of portability of new evaluation methods • The availability of MT systems for evaluation will influence our choice • Our survey of MT users (ongoing since January 2003) is also providing guidelines …

Language pairs translated by MT users (English as source language)

Language pairs translated by MT users (English as target language)

Which language pairs should be included? (2) Phase One: French, German, Spanish and Italian plus texts in typologically different languages (Chinese, Japanese) translated into English Phase Two: Consider additional source languages (eg. Portuguese and Russian into English) Phase Three: English translated into other languages

Which text types should be included? • MT systems are used in translation companies and international organisations to translate a number of different text types and topics • These text types and a variety of topics must be represented in our corpus • Our survey of MT users is providing guidelines on the kinds of texts and topics most frequently translated using MT systems …

Text types machine translated by companies

Text types machine translated by single users

Conclusions • We aim to provide a minimum of 14,000 words per language pair (further research to be conducted) • Text types will be based on responses to our survey, reflecting real MT use • The text types for each language pair will be the same, to give balance • Our corpus will be dynamic: updated to reflect changing trends in the MT user market • The key feature will be detailed scores from human evaluations, available for research (particularly in automated MT evaluation) • We plan to make our corpus and human evaluation results available online in 2004

Thank you We welcome your questions

Rationale for a multilingual corpus for machine translation evaluation

Rationale for a multilingual corpus for machine translation evaluation

Presentation Transcript

Corpus Evaluation

Human Evaluation of Machine Translation Systems

A Rationale for Collaboration

Towards a Methodology for a Corpus-Based Approach t o Translation Evaluation

Training a Parser for Machine Translation Reordering

Dependency-Based Automatic Evaluation for Machine Translation

Confidence Estimation for Machine Translation

A Path-based Transfer Model for Machine Translation

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

PMX 2058: Rationale for Recent Preclinical Evaluation

Multilingual Indexes for Detection and Translation

A Linguistic Approach for Multilingual Machine Translation System

Cluster Computing for Statistical Machine Translation

Semantic Evaluation of Machine Translation

The Multilingual Group have Experts for Website translation

Multilingual Translation Services | Language Translation

Subtitle Translation Services: Enhance Your Content for a Multilingual Audience

Stochastic Transductions for Machine Translation*

Machine Translation, Free Machine Translation