240 likes | 390 Views
FF Zagreb – Informacijske znanosti. Evaluation of Free Online Machine Translations for Croatian-English and English-Croatian Language Pairs Sanja Seljan , sseljan@ffzg.hr University of Zagreb - Faculty of Humanities and Social Sciences, Department of Information Sciences, Croatia
E N D
FF Zagreb – Informacijske znanosti Evaluation of Free Online Machine Translations for Croatian-English and English-Croatian Language Pairs SanjaSeljan, sseljan@ffzg.hr University of Zagreb - Faculty of Humanities and Social Sciences, Department of Information Sciences, Croatia MarijaBrkić, mbrkic@uniri.hr University of Rijeka, Department of Informatics, Croatia VlastaKučiš, asta.kucis@siol.net University of Maribor, Department of Translation Studies, Slovenia
Aim • Text evaluation from four domains (city description, law, football, monitors) • Cro-Eng - by four free online translation services (Google Translate, Stars21, InterTran and Translation Guide) • En- Croatian - by Google Translate • Measuring of inter-rater agreement (Fleiss kappa) • influence of error types on the criteria of fluency and adequacy • Pearson’s correlation
Introduction • MT evaluation • Experimentalstudy • Translationtools • Test set description • Evaluation • Erroranalysis • Correlations • Conclusion
I INTRODUCTION • increased use of online in recent years, even among less widely spoken languages • Desirable: moderateto good quality translations • evaluation from the user's perspective • Toolsandevaluationmainly for widelyspokenlanguages • Possible use: gistingtranslations, informationretrieval, i.e. question-answering systems • 1976 Systran - first MT for the Commission of the European Communities + onlinetool+ differentversions • 1997 - first online translationtool- Babel Fish using Systran technology • Important: realisticexpectations
Studies for popular languages • Considerable difference in the quality of translation dependent on the language pair • 2010 - German-French (GT, ProMT, WorldLingo) • 2011- threepopularonlinetools • 2006 - Spanish-English (introductorytextbook) • 2008 – 13 languagesintoEnglish (6 tools: BabelFish, GoogleTranslate, ProMT, SDL free translator, Systran, World Lingo)
MT evaluation – importantinresearch and product design • measure system performance • identify weak points andadjust parameter settings • language independent algorithms (BLEU, NIST) • Bettermetric – closer to human evaluation • need for qualitative evaluation of different linguistic phenomena
II EXPERIMENTAL STUDY • evaluation of free online translation services (FTS)– fromuser’s perspective • undergraduate and graduate students of languages, linguistics and information sciencesattendingcourses on language technologies at the University of Zagreb, Faculty of Humanities and Social Science Test set description • texts 4 domains (city description, law, football, monitors) • Cca 7-9 sentence perdomain(17.8 word/ sent.) • Cro-En, En-Cro
Evaluators • Cro-En: 48 students, finalyearofundergraduateandgraduatelevels • En-Cro: 50 students, nativespeakers • 75% of students attended language technology course(s) Evaluation – before pilot study Average grades for free language resources on the Internet
Croatian tools/resources Tools/ resources in general
Evaluation Manual evaluation • fluency (indicating how much the translation is fluent in the target language) • adequacy(indicating how much of the information is adequately transmitted) • evaluation enriched by translation errorsanalysis • morphological errors, • untranslatedwords • lexical errorsandword omissions • syntactic errors
Tools Cro-Entranslations • Google Translate (GT) - http://translate.google.com • Stars21 (S21) - http://stars21.com/translator • InterTran (IT) - http://transdict.com/translators/intertran.html • Translation Guide (TG) - http://www.translation-guide.com En-Cro translations • obtained from GoogleTranslate
GoogleTranslate • translation service provided by Google Inc. • statistical MT based on huge amount of corpora • Itsupports 57 languages, Croatiansince 2008 S21service • powered by GT • translations not always the same InterTran • powered by NeuroTran and WordTran • sentence-by-sentence and word-by-word TranslationGuide • powered by IT • Differenttranslations
Results - Cro-En • either low grades (TG and IT) or high grades (S21 and GT), in comparison to the average value (3.04) • S21(4.66) : GT (4.62) – city description, legal • GT – football, monitors • Best average result – legal domain, then monitors and football • Lowest – city description (the most free in style)
Results - Cro-En • En-Cro- lower average results than the reverse direction: football (3.75 : 4.84), law, monitors • Higher average grade in city description (shorter sentences, mostly nominative constructions, frequent terms) • Football domain - specific terms, non-nominative constructions
Error analysis En-Cro • Translations offered by GT and S21 are very similar, although not identical • TG and IT – differenceinnumberofuntranslatedwords • TG does not recognize words with diacritics Cro-En • the highest number of lexical errors, including also errors in style (av. 2.44) • Untranslatedwords (1.83), morphological (1.75), syntacticerrors (1.38) • Lowestscore, highest number of errors - footballdomain (mostly lexical errors and untranslated words) • best score – incitydescription domain (lexcialerrors) • Lowest no. errors – legaldomain (evenlydistributed)
Morphologicalerrors– mostlyindomainofmonitors, thesmallest no. incitydesription (dominantvalue 1) • Untranslated words - byfar mostly in the football • translation grades - mostlyinfluenced by untranslated words Dominantvalues • Morphologicalerrors: 1 incitydescriptionandmonitors, 3 inthe legal and football • Lexicalerrors: 1 incitydescription, othershigher • untranslated words - 1 in all domains • syntactic errors - 1 in all domains but football (2-3)
Pearson’s correlation • smaller number of errors augments the average grade • correlation between errors types and the criteria of fluency and adequacy • fluency - more affected by the increase of lexical and syntactic errors, • adequacy is more affected by untranslatedwords
Fleiss' kappa • for assessing the reliability of agreement among raters when giving ratings to the sentences • Indicating extent to which the observed amount of agreement among raters exceeds what would be expected if all the raters made their ratings completely randomly. • Score - between 0 and 1 (perfect agreement) • 0.0-0.20 slight agreement N – total of subjects • 0.21-0.40 fair agreement n – no. of raters per subject • 0.41-0.60 moderate agreement i – extent to which raters • 0.61-0.80 substantial agreement agree on i-subject • 0.81-1.00 almost perfect agreement j - categories
relatively high level of the agreement among raters per domain and per system in Cro-En translations • moderate 0.41-0.60 (for IT translation service), • substantial agreement 0.61-0.80 (S21 and GT) • perfect agreement 0.80-1.00 (TG – the worst tool) • En-Cro translations - inter-rater agreement per domain • lowest level of agreement has been detected in the domains of football and law (from 0.4-0.49 fair & moderate) – larger and more complex sentences • substantial agreement (0.61-0.80) – in city description • level of inter-rater agreement is lower for En-Cro translations in all domains
Conclusion • evaluation study of MT in 4 domains • Cro-En – 4 free online translation services • En-Cro translations – by Google Translate • Evaluator’s profile • high interest in use of translation resources and tools • Critical evaluation • System evaluation • perfect agreement in the ranking of TG as the worst translation service • substantial agreement is achieved for S21 and GT services • moderate agreement is shown for IT, which has performed slightly better than TG.
Cro-En translations • S21 and GT ( 4.63 to 4.84)- football, law and monitors • city description - Cro-En lower than in En-Cro En-Crodirection– by GT • lower grades than in the opposite direction (specificterms, non-nominative constructions, multi-wordunits) • Exceptcity description domain - containingmostly nominative constructions, frequent words, no specific terms Error analysis • translation grades are mostly influenced by untranslated words (especially the criteria of adequacy) • morphological and syntactic errors reflect grades in smaller proportion (fluency) • ,
GoogleTranslate service • used in both translation directions • harvesting data from the Web, seems to be well trained and suitable for the translation of frequent expressions • Doesn’t perform well where language information is needed, e.g. gender agreement, in MW expressions Furtherresearch • Betterquantitavieanalysisperdomain • more detailed analysis of specific language phenomena
FF Zagreb – Informacijske znanosti Evaluation of Free Online Machine Translations for Croatian-English and English-Croatian Language Pairs SanjaSeljan, sseljan@ffzg.hr University of Zagreb - Faculty of Humanities and Social Sciences, Department of Information Sciences, Croatia MarijaBrkić, mbrkic@uniri.hr University of Rijeka, Department of Informatics, Croatia VlastaKučiš, asta.kucis@siol.net University of Maribor, Department of Translation Studies, Slovenia