User-focused task-oriented MT evaluation for wikis: a case study

User-focused task-oriented MT evaluationfor wikis: a case study Federico Gaspari, Antonio Toral, and Sudip Kumar Naskar School of Computing Dublin City University Dublin 9, Ireland {fgaspari, atoral, snaskar}@computing.dcu.ie Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Outline • Introduction: the CoSyne project • Related work • Evaluation • framework, scenario, questionnaire • Results and discussion • Conclusions • Future work Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Introduction: CoSyne • Aim: Synchronisation of multilingual wikis • Consortium • 7 partners from Germany, Italy, the Netherlands and Ireland • 3 academic partners • University of Amsterdam (UvA) • Fondazione Bruno Kessler (FBK) • Dublin City University (DCU) • 1 research organization • Heidelberg Institute for Theoretical Studies (HITS) • 3 end-users • Deutsche Welle (DW) • Netherlands Institute for Sound and Vision (NISV) • Vereniging Wikimedia Nederland (VWN) 3 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Introduction: CoSyne • Techniques used by the CoSyne system: • MT • Textual entailment • Document structure modelling • Overlap synchronisation • Insertion point detection • CoSyne MT system developed by UvA (Martzoukos and Monz, 2010) • Language pairs covered in year 1: DE / IT / NL ↔ EN • Focus of this user evaluation • CoSyne MT software to translate wiki entries DE→EN and NL→EN 4 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Related work • MT quality evaluation • fluency • adequacy • Automatic MT evaluation metrics, esp. for SMT (Toral et al., 2011) • BLEU (Papineni et al., 2002), METEOR (Banerjee & Lavie, 2005), etc. • no insight into the nature and severity of errors (e.g. for post-editing) • weak correlation with human judgement (Lin & Och, 2004) • Usefulness of MT output and users’ level of satisfaction • Post-editing • effort (e.g. Allen, 2003; O’Brien, 2007; Specia & Farzindar, 2010) • gains vs. translating from scratch (e.g. O’Brien, 2005; Specia 2011) 5 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Evaluation framework • User-focused task-oriented evaluation of MT in/for wikis • in close collaboration with end-users (DW, NISV) • Accompanied by diagnostic evaluation • providing useful feedback to MT developers (UvA) • Pilot study conducted just before month 18 of 36-month project • full-scale final evaluation planned at the very end of the project 6 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Evaluation scenario • Protocol for evaluation agreed between DCU and end-users • DW and NISV staff involved: editors, translators, project managers • German-English and Dutch-English as their working languages • final users of the CoSyne system for wiki content synchronization • Evaluation conducted on typical wiki entries for end-users • Users asked to focus only on linguistic quality and level of usefulness of MT (disregarding other components of the CoSyne system) 7 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Evaluation scenario Deutsche Welle (DW): KalenderBlatt / Today in History 8 8 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Evaluation scenario Netherlands Institute for Sound and Vision (NISV): wiki 9 9 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Evaluation scenario Netherlands Institute for Sound and Vision (NISV): wiki 10 10 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Evaluation scenario • Time-tracking system was implemented • Post-editing changes performed by the participants were logged • Before the evaluation • participants given presentation and demo of the CoSyne system • preliminary experimentation with the CoSyne system for 1-3 hours 11 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Evaluation questionnaire • Written questionnaire administered on paper • available at http://www.computing.dcu.ie/~atoral/cosyne/quest.pdf • Questions grouped into 6 parts focusing on different aspects • Approximately 50 items using different formats • Likert scale, multiple choice and open questions • Part A: basic demographic information about the respondents • Part B: previous use of MT • Part C: users' evaluation of the CoSyne MT system • Part D: post-editing work • Part E: general comments and feedback • (Part F: usability and interaction design of the overall CoSyne system) 12 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: demographics • 10 users: 6 from DW, 4 from NISV • 6 men and 4 women across DW and NISV • Variety of roles: editors, authors, translators and project managers • Average age: 34 (youngest 20, oldest 46) • Average work experience: just over 3 years (min. 3 months, max. 10 years) 13 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: background • All (4) NISV staff were native speakers of Dutch • 5 DW users were German native speakers + 1 NS of Romanian fluent in German • 80% of the participants self-rated their knowledge of English as upper-intermediate, 20% defined it as intermediate or excellent • None of the respondents considered themselves bilingual 14 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: previous use of MT • 80% had used MT before our experiment • 7 for personal reasons, 6 for work (commonly for both purposes) • all but one had used Google Translate, 1 had tried Babel Fish, 2 both • Language combinations used • 4 from EN into other languages • 6 into EN from a range of source languages • 5 language combinations not involving English • 75% used MT for assimilation purposes vs. 25% for dissemination • 62.5% had post-edited raw MT to obtain high-quality translations 15 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: previous use of MT • Materials translated with MT by the 8 respondents • for study purposes (academic papers and uni-related texts): 3 • business correspondence, personal or professional emails: 2 • contracts and technical documents: 2 • online articles: 2 • websites: 2 (“the translations of Dutch sites to English were hilarious!”, but not using CoSyne MT system!!) • Wikipedia content: 1 16 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: previous use of MT Quality of previously used MT systems on a 5-point scale (1 = very poor to 5 = very good) • Overall the 8 respondents had a predominantly negative-to-neutral impression of MT quality before taking part in the evaluation of the CoSyne MT system, based on a 5-point Likert scale (average 2.8 / 5) 17 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: CoSyne MT system quality usefulness Quality and usefulness of the CoSyne MT systemon a 5-point scale (cf. 2.8) (1 = very poor to 5 = very good) • Average quality is medium (3 / 5), better than previous experience (2.8) • Usefulness slightly higher than medium (3.3 /5) 18 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: CoSyne MT system Is CoSyne MT faster than translating wiki entries into English from scratch?on a 7-point scale (1 = strongly disagree to7 = strongly agree) • Average value higher than mid-point of the scale (4.6 / 7) • In line with e.g. Plitt & Masselot (2010) and Flournoy & Rueppel (2010) • From DE almost twice as good as from NL (due to style of wiki texts?) 19 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: CoSyne MT system MT quality broken down into: accuracy correctness, comprehensibility readability style on a 7-point scale corr comp read styl accu (1 = poor to 7 = excellent) • We did not explain to users the subtle differences involved • Only accuracy is approx. average (3.6 / 7), other criteria lower • None of the average values particularly poor (DE always better than NL) 20 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: post-editing CoSyne frequency time effort Amount of work, in terms of time and effort to post-edit the MT output Need to refer to source language while post-editing on a 7-point scale (1 = short/small to 7 = long/large) on a 7-point scale (1 = never to 7 = always) 21 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: post-editing CoSyne insertion deletion substitution reordering del sub reo ins del sub ins reo Severity of errors overpost-editing operations Frequency of errors overpost-editing operations on a 7-point scale (1 = irrelevant to 7 = very serious) on a 7-point scale (1 = absent to 7 = frequent) 22 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Results: final comments • Positive aspects: • good to have draft translation to work upon • integration in the wiki environment • potential to speed up the translation task • Weaknesses: • translation quality needs improving, due to • wrong translation of pronouns • verbs frequently dropped • incorrect word order • mistranslated compounds • limited lexical coverage (OOV items is an issue) • Good potential of the CoSyne system based on first prototype 23 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Conclusions • User-focused task-oriented questionnaire-based evaluation for MT used in wikis, supported by post-editing • Evaluation of the first Y1 prototype of the CoSyne MT system for DE→EN and NL→EN • Quality of the CoSyne MT system perceived by the users higher than that of previously used MT systems • Post-editing effort is considered high, but users found it less time- consuming than translating from scratch • Translations from German rated better than those from Dutch • contrasts with earlier findings (Toral et al., 2011) • further investigation into this discrepancy (meta-evaluation) 24 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Future work • Extend analysis looking into the post-editing logs, considering actual post-editing time (to estimate costs) • Involve more users after pilot stage • Include a control group (translating manually or other MT s/w) • Investigate correlation between the post-editing carried out by the users and the results provided by TER and TERp (ins, del…) • Use our linguistically-aware diagnostic evaluation tool (DELiC4MT) to monitor performance of the MT system on specific issues flagged up by the users 25 Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

Thank you for your attention! Questions? User-focused task-oriented MT evaluationfor wikis: a case study Federico Gaspari, Antonio Toral, and Sudip Kumar Naskar School of Computing Dublin City University Dublin 9, Ireland {fgaspari, atoral, snaskar}@computing.dcu.ie Third Joint EM+ / CNGL Workshop, Luxembourg 14 October 2011

User-focused task-oriented MT evaluation for wikis: a case study