1 / 26

Exploding the Myth the gerund in machine translation

Exploding the Myth the gerund in machine translation. Nora Aranberri. Background. Nora Aranberri PhD student at CTTS (Dublin City University) Funded by Enterprise Ireland and Symantec (Innovation Partnerships Programme) Symantec Software publisher Localisation requirements

usoa
Download Presentation

Exploding the Myth the gerund in machine translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploding the Myththe gerund in machine translation Nora Aranberri

  2. Background • Nora Aranberri • PhD student at CTTS (Dublin City University) • Funded by Enterprise Ireland and Symantec (Innovation Partnerships Programme) • Symantec • Software publisher • Localisation requirements • Translation – Rule-based machine translation system (Systran) • Documentation authoring – Controlled language (CL checker: acrocheck™) • Project: CL checker rule refinement

  3. The Myth The gerund is handled badly by MT systems and should be avoided The gerund is handled badly by MT systems and should be avoided • Sources: translators, post-editors, scholars • Considered a translation issue for MT due to its ambiguity • Bernth & McCord, 2000; Bernth & Gdaniec, 2001 • Addressed by CLs • Adriaens & Schreurs, 1992; Wells Akis, 2003; O’Brien 2003; Roturier, 2004 • Sources: translators, post-editors, scholars • Considered a translation issue for MT due to its ambiguity • Bernth & McCord, 2000; Bernth & Gdaniec, 2001 • Addressed by CLs • Adriaens & Schreurs, 1992; Wells Akis, 2003; O’Brien 2003; Roturier, 2004

  4. What is a gerund? • -ing either a gerund, a participle, or continuous tense keeping the same form • Examples • GERUND: Steps for auditing SQL Server instances. • PARTICIPLE: When the job completes, BACKINT saves a copy of the Backup Exec restore logs for auditing purposes. • CONTINUOUS TENSE: Server is auditing and logging. • Conclusion: gerunds and participles can be difficult to differentiate for MT.

  5. Methodology: creating the corpus • Initial corpus • Risk management components texts • 494,618 words • uncontrolled • Structure of study • Preposition or subordinate conjunction + -ing • Extraction of relevant segments • acrocheck™: CL checker asked to flag the patterns of the structure • IN + VBG|NN|JJ “-ing” • 1,857 sentences isolated

  6. Methodology: translation • Apply machine translation for target language • MT used: Systran Server 5.05 • Dictionaries • No specific dictionaries created for the project • Systran in-built computer science dictionary applied • Languages • Source language: English • Target languages: Spanish, French, German and Japanese

  7. Methodology: evaluation (1) • Evaluators • one evaluator per target language only • native speakers of the target languages • translators / MA students with experience in MT • Evaluation format

  8. Methodology:evaluation (2) • Analysis of the relevant structure only • Questions: • Q1: is the structure correct? • Q2: is the error due to the misinterpretation of the source or because the target is poorly generated? • Both are “yes/no” questions.

  9. Results:prepositions / subordinate conjunctions

  10. Results:correctness for Spanish

  11. Results:correctness for French

  12. Results:correctness for German

  13. Results:correctness for Japanese

  14. Significant results

  15. Results:correlation of problematic structures • The most problematic structures seem to strongly correlate across languages • Top 6 prep/conj account for >65% of errors

  16. Analysis and generation errors

  17. Analysis and generation errors

  18. Analysis and generation errors

  19. Analysis and generation errors

  20. Analysis and generation errors

  21. Analysis and generation errors

  22. Source and target error distribution • Target errors seem to be more important across languages • The prep/conj with the highest error rate and common to 3 or 4 target languages cover 43-54% of source errors and 48-59% of target errors

  23. Conclusions • Overall success rate between 70-80% for all languages • Target language generation errors are higher than the errors due to the misinterpretation of the source. • Great diversity of prepositions/subordinate conjunctions with varying appearance rates. • Strong correlation of results across languages.

  24. Next steps • Further evaluations to consolidate results • 4 evaluators per language • Present sentences to the evaluators out of alphabetical order by preposition/conjunction • Note the results for the French “when”. • Make these findings available to the writing teams • Take our prominent issues • Source issues • controlled language or pre-processing • Formulate more specific rules in acrocheck to handle the most problematic structures/prepositions and reduce false positives • Standardise structures with low frequencies • Target issues • post-processing or MT improvements

  25. References • Adriaens, G. and Schreurs, D., (1992) ‘From COGRAM to ALCOGRAM: Toward a Controlled English Grammar Checker’, 14th International Conference on Computational Linguistics, COLING-92, Nantes, France, 23-28 August, 1992, 595-601. • Bernth, A. and Gdaniec, C. (2001) ‘MTranslatability’ Machine Translation 16: 175-218. • Bernth, A. and McCord, M. (2000) ‘The Effect of Source Analysis on Translation Confidence’, in White, J. S.,  eds., Envisioning Machine Translation in the Information Future: 4th Conference of the Association for Machine Translation in the Americas, AMTA 2000, Cuernavaca, Mexico, 10-14 October, 2000, Springer: Berlin, 89-99. • O’Brien, S. (2003) ‘Controlling Controlled English: An Analysis of Several Controlled Language Rule Sets’, in Proceedings of the 4th Controlled Language Applications Workshop (CLAW 2003), Dublin, Ireland, 15-17 May, 2003, 105-114. • Roturier, J. (2004) ‘Assessing a set of Controlled Language rules: Can they improve the performance of commercial Machine Translation systems?’, in ASLIB Conference Proceedings, Translating and the Computer 26, London, 18-19 November, 2004, 1-14. • Wells Akis, J. and Sisson, R. (2003) ‘Authoring translation-ready documents: is software the answer?’, in Proceedings of the 21st annual international conference on Documentation, SIGDOC 2003, San Francisco, CA, USA, October 12-15, 2003, 38-44.

  26. Thank you! e-mail: nora.aranberrimonasterioATdcu.ie

More Related