Human Evaluation of Machine Translation Systems

Human Evaluation of Machine Translation Systems MODL5003 Principles and applications of machine translation Lecture 13/03/2006 Bogdan Babych bogdan@comp.leeds.ac.uk (Slides: Debbie Elliott, Tony Hartley)

Outline • MT Evaluation – general perspective • Purposes of MT evaluation • Why evaluating translation quality is difficult • A brief history of MT evaluation • Examples of MT evaluation methods for users • Where next? MODL5003 Principles and applications of MT

MT Evaluation – a big space • Requirements • Task: assimilation, dissemination, ... • Text: type, provenance, ... • User: translators, consumers, ... • Quality attributes • Internal: architecture, resources, ... • External: readability, fidelity, well-formedness, ... MODL5003 Principles and applications of MT

What is evaluated • “It looks good to me” evaluation • Test suites • Syntactic coverage, degradation • Corpus-based evaluation • Real texts • Don’t have to read them all ! • Coverage of typical problems • Interaction between different levels • General performance (bird's-eye view) • Aspects more/less important for overall quality MODL5003 Principles and applications of MT

Purposes of MT evaluation International Standard for Language Engineering (ISLE) http://www.mpi.nl/ISLE/ Framework for the Evaluation of Machine Translation in ISLE (FEMTI) http://www.issco.unige.ch/projects/isle/femti/framed-glossary.html Defines 7 types of MT evaluation: • Feasibility testing • Requirements elicitation • Internal evaluation • Diagnostic evaluation • Declarative evaluation • Operational evaluation • Usability evaluation MODL5003 Principles and applications of MT

Purposes of MT evaluation 1. Feasibility testing “An evaluation of the possibility that a particular approach has any potential for success after further research and implementation.” (White 2000) (Eg. sub-problems connected to a particular language pair) Purpose: To decide whether to invest in further research into a particular approach For: Researchers, sponsors of research MODL5003 Principles and applications of MT

Purposes of MT evaluation 2. Requirements elicitation Researchers and developers create prototypes designed to demonstrate particular functional capabilities that might be implemented Purpose: To elicit reactions from potential investors before implementing new approaches. For: Researchers and developers, project managers, end-users MODL5003 Principles and applications of MT

Purposes of MT evaluation 3. Internal evaluation Researchers and developers test components of a prototype or pre-release system. This can involve the use of test suites to evaluate output quality during the course of system modifications. Purpose: • To measure how well each component performs its function • To test linguistic coverage (that a new grammar rule works in all circumstances) • Iterative testing: to check that particular modifications do not have adverse effects elsewhere For: Researchers, developers, investors MODL5003 Principles and applications of MT

Purposes of MT evaluation 4. Diagnostic evaluation Researchers and developers of prototype systems evaluate functionality characteristics and analyse intermediate results produced by the system Purpose: To discover why a system did not give the expected results For: Researchers and developers MODL5003 Principles and applications of MT

Purposes of MT evaluation 5. Declarative evaluation Evaluators rate the quality of MT output Purpose: • To measure how well a system translates • To measure fidelity (how much of the source text content is correctly conveyed in the target text) • To measure the fluency of the target text • To measure the usability of a MT output for a particular purpose • To evaluate a system’s improvability (to what extent can dictionary update improve output quality?) • To help decide which system to buy • To indicate whether buying a system will be cost-effective (will post-editing MT output be cheaper than translating from scratch?) For: End-users, researchers, developers, managers, investors, vendors MODL5003 Principles and applications of MT

Purposes of MT evaluation 6. Operational evaluation Managers calculate purchase and running costs and compare with benefits Purpose: To determine the cost-benefit of an MT system in a particular operational environment, and whether a system will serve its required purpose For: Managers, investors, vendors MODL5003 Principles and applications of MT

Purposes of MT evaluation 7. Usability evaluation • Evaluators test how easy the application is to use. • Systems are evaluated using questionnaires on usability. • Evaluators may record how long it takes to complete particular tasks Purpose: To measure how useful the product will be for the end-user in a specific context To evaluate user-friendliness For: End-users, researchers, developers, managers, investors, vendors MODL5003 Principles and applications of MT

Why evaluating translation quality is difficult • No perfect standard exists for comparison • Scoring is subjective (so several evaluators and texts are needed) • The "training effect" can influence results • Bilingual evaluators or human reference translations are usually required • Different evaluations are needed depending on use of MT output (eg. filtering, gisting, information gathering, post-editing for internal or external use) MODL5003 Principles and applications of MT

A brief history of MT evaluation: the 1950’s and 1960’s • 1954: First public demonstration MT (Georgetown University/IBM) • Research in USA, Western Europe, Soviet Union and Japan • 1966: ALPAC Report (funded by US Government sponsors of MT to advise on further R & D) … • advised against further investment in MT • concluded that MT was slower, less accurate and more expensive that human translation • recommended research into: - practical methods for evaluation of translations - evaluation of quality and cost of various sources of translations - evaluation of the relative speed and cost of various sorts of machine-aided translation MODL5003 Principles and applications of MT

A brief history of MT evaluation: the 1970’s and 1980’s • 1976: EC bought a version of Systran and began to develop own system Eurotra in 1978 • EC needed recommendations for evaluation: “Van Slype report” (Critical Methods for Evaluating the Quality of Machine Translation.) published in 1979 Aims of the report: - to establish the state of MT evaluation - to advise the EC on evaluation methodology and research - to provide examples of evaluation methods and their applications Available online: http://issco-www.unige.ch/projects/isle/van-slype.pdf • 1980’s: Greater need for MT evaluation: MT attracting commercial interest, tailor-made systems designed for large corporations • 1987: First MT Summit (opportunity to publish research on evaluation) MODL5003 Principles and applications of MT

A brief history of MT evaluation: The 1990’s • 1992: AMTA workshop: MT Evaluation: Basis for Future Directions JEIDA Report presented (Japan Electronic Industry Development Association): Methodology and Criteria on Machine Translation Evaluation. This stressed the importance of judging systems according to context of use and user requirements • 1993: Machine Translation journal devoted to MT evaluation • 1992 -1994 DARPA (Defense Advanced Research Projects Agency) MT evaluations • 1993 - 1999 EAGLES (Expert Advisory Group on Language Engineering) set up by European Commission One of aims: To propose standards, guidelines and recommendations for good practice in the evaluation of language engineering products The EAGLES 7-step recipe for evaluation: http://www.issco.unige.ch/projects/eagles/ewg99/7steps.html MODL5003 Principles and applications of MT

The ISLE Project and FEMTIInternational Standards for Language EngineeringFramework for the Evaluation of Machine Translation in ISLE • ISLE Evaluation Working Group set up in response to EAGLES • Funded by EC, National Science Foundation of the USA and Swiss Government • Established a classification scheme of quality characteristics of MT systems and a set of measures to use when evaluating these characteristics • Scheme designed to help developers, users and evaluators to select evaluation criteria according to their needs • Workshops organised to involve hands-on evaluation exercises to test reliability of metrics • Latest research involves the investigation of automated evaluation methods: quicker and cheaper MODL5003 Principles and applications of MT

Evaluation methods: Carroll 1966 Source: Carroll, J. B. (1966). An experiment in evaluating the quality of translations. In Pierce, J. (Chair). (1966). Language and Machines: computers in Translation and Linguistics. Report by the Automatic Language Processing Advisory Committee (ALPAC). Publication 1416. National Academy of Sciences National Research Council, pp 67-75. http://www.nap.edu/books/ARC000005/html/ • Evaluation of scientific Russian texts translated into English • 3 human translations and 3 machine translations of 4 texts evaluated • Evaluators: 18 monolingual English speakers and 18 native English speakers with good understanding of scientific Russian Intelligibility: • Each sentence scored on a 9-point scale with no reference to source text Informativeness (fidelity) • Original Russian sentences rated for informativeness compared with the translation • Monolinguals used human reference translations instead of source texts for comparison MODL5003 Principles and applications of MT

Evaluation methods: Carroll 1966 Extracts from 9-point intelligibility scale 9. Perfectly clear and intelligible. Reads like ordinary text: has no stylistic infelicities 5. The general idea is intelligible only after considerable study, but after this study one is fairly confident that he understands. Poor word choice, grotesque syntactic arrangement, untranslated words, and similar phenomena are present, but constitute mainly "noise" through which the main idea is still perceptible 1. Hopelessly unintelligible. It appears that no amount of study and reflection would reveal the thought of the sentence. MODL5003 Principles and applications of MT

Evaluation methods: Carroll 1966 Rating originalsentences Extracts from 10-point informativeness scale 9. Extremely informative. Makes "all the difference in the world" in comprehending the meaning intended. (A rating of 9 should always be assigned when the original completely changes or reverses the meaning conveyed by the translation) 4. In contrast to 3, adds a certain amount of information about the sentence structure and syntactical relationships; it may also correct minor misapprehensions about the general meaning of the sentence or the meaning of individual words 0. The original contains, if anything, less information than the translation. The translator has added certain meanings, apparently to make the passage more understandable. MODL5003 Principles and applications of MT

Evaluation methods: Crook & Bishop 1979 Source: Crook & Bishop (reported by T C Halliday). Measurement of readability by the cloze test. In Van Slype, G.. (1979). Critical Methods for Evaluating the Quality of Machine Translation. Prepared for the European Commission Directorate General Scientific and Technical Information and Information Management. Report BR 19142. Bureau Marcel van Dijk, p65. • Evaluators rate the readability of translations using a cloze test • Human and machine translations produced • Every eighth word of machine translation omitted • Evaluators fill in the gaps • The more intelligible the MT output, the easier the “test” is to complete MODL5003 Principles and applications of MT

Evaluation methods: Sinaiko 1979 Source: Sinaiko, H. W. Measurement of usefulness by performance test. In Van Slype, G. (1979). Critical Methods for Evaluating the Quality of Machine Translation. Prepared for the European Commission Directorate General Scientific and Technical Information and Information Management. Report BR 19142. Bureau Marcel van Dijk, p91. • Aim: to evaluate the English-Vietnamese LOGOS system • All source texts contained instructions • Evaluators (native speakers of TL) use machine translated instructions to perform tasks • Errors in performance were measured (weighting system used) MODL5003 Principles and applications of MT

Evaluation methods: Nagao 1985 Source: Nagao, M., Tsujii, J. & Nakamura, J. (1985). The Japanese government project for machine translation. In Computational Linguistics 11, 91-109. • Aim: to test the feasibility of using MT to translate abstracts of scientific papers • 1,682 sentences from a Japanese scientific journal were machine translated into English • Intelligibility: • 2 native speakers of English (with no knowledge of Japanese) scored each sentence using a 5-point scale • Accuracy: • 4 Japanese-English translators evaluated how much of the meaning of the original text was conveyed in the MT output MODL5003 Principles and applications of MT

Evaluation methods: Nagao 1985 Extracts from 5-point intelligibility scale 1. The meaning of the sentence is clear, and there are no questions. Grammar, word usage, and style are all appropriate, and no rewriting is needed. 3. The basic thrust of the sentence is clear, but the evaluator is not sure of some detailed parts because of grammar and word usage problems. The problems cannot be resolved by any set procedure; the evaluator needs the assistance of a Japanese evaluator to clarify the meaning of those parts in the Japanese original. 5. The sentence cannot be understood at all. No amount of effort will produce any meaning. MODL5003 Principles and applications of MT

Evaluation methods: Nagao 1985 Extracts from 7-point accuracy scale 0. The content of the input sentence is faithfully conveyed to the output sentence. The translated sentence is clear to a native speaker and no rewriting is needed. 3. While the content of the input sentence is generally conveyed faithfully to the output sentence, there are some problems with things like relationships, between phrases and expressions, and with tense, voice, plurals, and the positions of adverbs. There is some duplication of nouns in the sentence. 6. The content of the input sentence is not conveyed at all. The output is not a proper sentence; subjects and predicates are missing. In noun phrases, the main noun (the noun positioned last in the Japanese) is missing, or a clause or phrase acting as a verb and modifying a noun is missing. MODL5003 Principles and applications of MT

Evaluation methods: DARPA 1992-4 Adequacy, fluency and informativeness Sources: White, J., O'Connell, T., O’Mara, F.: The ARPA MT evaluation methodologies: evolution, lessons, and future approaches. In: Proceedings of the 1994 Conference, Association for Machine Translation in the Americas, Columbia, Maryland (1994) White, J. (Forthcoming). How to evaluate Machine Translation. In H. Somers (ed.) Machine translation: a handbook for translators. Benjamins, Amsterdam. • Aim: to compare prototype systems funded by DARPA • Evaluators: 100 monolingual native English speakers • Largest evaluation resulted in: corpus of 100 news articles (of c.400 words) in each SL: French, Spanish and Japanese 2 English human translations of each English machine translations of each text by several systems Detailed evaluation results MODL5003 Principles and applications of MT

Evaluation methods: DARPA 1992-4 DARPA: Adequacy Segments of MT output were compared with equivalent human reference translations and scored on a 5-point scale according to how much of the original content was preserved (regardless of imperfect English) 5 – All meaning expressed in the source fragment appears in the translation fragment 4 – Most of the source fragment meaning is expressed in the translation fragment 3 – Much of the source fragment meaning is expressed in the translation fragment 2 – Little of the source fragment meaning is expressed in the translation fragment 1 – None of the meaning expressed in the source fragment is expressed in the translation fragment MODL5003 Principles and applications of MT

Evaluation methods: DARPA 1992-4 DARPA: Fluency • Each sentence scored for intelligibility without reference to the source text or human reference translation • Simple 5-point scale used DARPA: Informativeness • Designed to test whether enough information was conveyed in MT output to enable evaluators to answer questions on its content • Each translation accompanied by 6 multiple-choice questions on content • 6 choices for each question MODL5003 Principles and applications of MT

DARPA-inspired evaluation methods… Many subsequent evaluations have followed in the footsteps of DARPA ….. Fluency and Adequacy using 5-point scales: Source: Elliott, D., Atwell, E., Hartley, A.: Compiling and Using a Shareable Parallel Corpus for Machine Translation Evaluation. In: Proceedings of the Workshop on The Amazing Utility of Parallel and Comparable Corpora, Fourth International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal (2004) Usability 5-point scale MODL5003 Principles and applications of MT

Where next? • Beyond similarity metrics • FEMTI offers a rich palette of techniques • Beyond adequacy and fluency • Too generic / abstract for specific tasks? • Consider MT output in its own right • Beyond conventional uses of MT as surrogate human translation (emulation) • MT as a component in a workflow MODL5003 Principles and applications of MT

Restore a sense of purpose Texts are meant to be used. There are no absolute standards of translation quality but only more or less appropriate translations for the purpose for which they are intended.(Sager 1989: 91) MODL5003 Principles and applications of MT

Revisit MT proficiency(White 2000) • View MT output as a genre • Characterise inadequacy, disfluency,ill-formedness • Embed MT and adapt (to) the environment • in IE, CLIR, CLQA, Speech2Speech • in pre- and post-editing MODL5003 Principles and applications of MT

Any questions…? Human Evaluation of Machine Translation Systems MODL5003 Principles and applications of MT

Human Evaluation of Machine Translation Systems

Human Evaluation of Machine Translation Systems

Presentation Transcript

Machine Translation

Machine Translation

Human-Machine Systems

Quantitative Evaluation of Machine Translation Systems: Sentence Level

Machine Translation

Machine Translation

Machine Translation

Overview of Human-Machine Systems

Machine Assisted Human Translation (MAHT)

Evaluating the Output of Machine Translation Systems

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Semantic Evaluation of Machine Translation

Human translation vs Machine translation

Evaluation of Machine Translation Systems: Metrics and Methodology

Machine Translation

Machine Translation

Why Human Translation is better than Machine Translation

Machine Translation, Free Machine Translation