1 / 50

a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

d ata is c ore. s. a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com. LOCALIZATION WORLD PARIS, JUNE 5, 2012. L anguage technology developer Localization service provider Leadership in smaller languages Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)

zihna
Download Presentation

a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. dataiscore s andrejsvasiļjevs chairmanoftheboardandrejs@tilde.com LOCALIZATION WORLD PARIS, JUNE 5, 2012

  2. Language technology developer • Localization service provider • Leadership in smaller languages • Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania) • 135 employees • Strong R&D team • 9 PhDs and candidates

  3. machinetranslation MT machinetranslation

  4. disruptive INNOVATION disruptive

  5. MT paradigms rule-based MT • High quality translation in specialized domains • Require highly qualified linguists, researchers and software developers • Time and resource consuming • Difficult to evolve statistical MT • Translation and linguistic knowledge is derived from data • Relatively easy and quick to develop • Requires huge amounts of parallel and monolingual data • Translation quality inconsistent and can differ dramatically from domain to domain

  6. CHALLENGE

  7. 15 largest languages 50%

  8. domains

  9. one size fits all ?

  10. DATA

  11. The total body of European Union law applicable in the EU Member States JRC-Acquis http://langtech.jrc.it/JRC-Acquis.html

  12. The DGT Multilingual Translation Memoryof the AcquisCommunautaire DGT-TM http://langtech.jrc.it/DGT-TM.html

  13. Paralleldatacollectedfromthe Web byUniversityofUppsala 90 languages, 3800 language 2,7B parallelunits Opus http://opus.lingfil.uu.se

  14. openEuropeanlanguageresourceinfrastructure http://www.meta-net.eu

  15. Datafor SMT training

  16. PLATFORM

  17. [ttable-file] 0 0 5 /.../unfactored/model/phrase-table.0-0.gz % ls steps/1/LM_toy_tokenize.1* | cat steps/1/LM_toy_tokenize.1 steps/1/LM_toy_tokenize.1.DONE steps/1/LM_toy_tokenize.1.INFO steps/1/LM_toy_tokenize.1.STDERR steps/1/LM_toy_tokenize.1.STDERR.digest steps/1/LM_toy_tokenize.1.STDOUT % train-model.perl \ --corpus factored-corpus/proj-syndicate \ --root-dirunfactored \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 % moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“ use-berkeley = true alignment-symmetrization-method = berkeley berkeley-train = $moses-script-dir/ems/support/berkeley-train.sh berkeley-process = $moses-script-dir/ems/support/berkeley-process.sh berkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jar berkeley-java-options = "-server -mx30000m -ea" berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8" berkeley-process-options = "-EMWordAligner.numThreads 8" berkeley-posterior = 0.5 tokenize in: raw-stem out: tokenized-stem default-name:corpus/tok pass-unless:input-tokenizeroutput-tokenizer template-if:input-tokenizerIN.$input-extensionOUT.$input-extension template-if:output-tokenizerIN.$output-extensionOUT.$output-extension parallelizable: yes working-dir = /home/pkoehn/experiment wmt10-data = $working-dir/data Moses toolkit

  18. build yourown MT engine

  19. Tilde / Coordinator LATVIA University of Edinburgh UK Uppsala University SWEDEN Copehagen University DENMARK UniversityofZagreb CROATIA Moravia CZECH REPUBLIC SemLab NETHERLANDS

  20. Cloud-basedself-serviceMT factory • Repository of parallel and monolingual corpora for MT generation • Automated training of SMT systems from specified collections of data • Users can specify particular training data collections and build customised MT engines from these collections • Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data

  21. Stores SMT training data • Supports different formats – TMX, XLIFF, PDF, DOC, plain text • Converts to unified format • Performs format conversions and alignment Resource Repository

  22. Put users in control of their data • Fully public or fully private should not be the only choice • Data can be used for MT generation without exposing it • Empower users to create custom MT engines from their data user-drivenmachinetranslation

  23. Integration with CAT tools • Integration in web pages • Integration in web browsers • API-level integration integration

  24. Integrationof MT in SDL Trados

  25. usecase FORTERA

  26. EVALUATION

  27. Keyboard-monitoring of post-editing (O´Brien, 2005) • Productivity of MS Office localization (Schmidtke, 2008) 5-10% productivitygain for SP, FR, DE • Adobe(FlournoyandDuran, 2009) 22%-51% productivity increase for RU, SP, FR • AutodeskMoses SMT system(PlittandMasselot, 2010) 74% average productivity increase for FR, IT, DE, SP Previous Work

  28. Latvian: • About 1,6 M native speakers • Highly inflectional - ~22M possible word forms in total • Official EU language • Tilde English – Latvian MT system • IT Software Localization Domain • Evaluation of translators’ productivity Evaluationat Tilde

  29. English-Latvian data

  30. Evaluate original / assign Translator and Editor Analyze against TMs Translateusing translation suggestions for TMsand MT Evaluatetranslationquality / Edit Fix errors Ready translation MT translate new sentences MT Integration into Localization Workflow

  31. Key interest of localization industry is to increase productivity of translation process while maintaining required quality level • Productivity was measured as the translation output of an average translator in words per hour • 5 translators participated in evaluation including both experienced and new translators Evaluation of Productivity

  32. Performed by human editors as part of their regular QA process • Result of translation process was evaluated, editors did not know was or was not MT applied to assist translator • Comparison to reference is not part of this evaluation • Tilde standard QA assessment form was used covering the following text quality areas: • Accuracy • Spelling and grammar • Style • Terminology EvaluationofQuality

  33. Tilde Localization QA assessment applied in the evaluation QA Grades

  34. 54 documents in IT domain • 950-1050 adjusted words in each document • Each document was split in half: • the first part was translated using suggestions from TM only • the second half was translated using suggestions from both TM and MT Evaluation data

  35. Latvian 32.9%* productivity % * Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium

  36. IT Localization domain • Systems trained on the LetsMT platform • English - Czech translation • 25.1% productivity increase • Error score increase from 19 to 27, still at the GOOD grade (<30) • English – Polish translation • 28.5% productivityincrease • Error score increase from 16.8 to 23.6, still at the GOOD grade (<30) Evaluation at Moravia

  37. Slovak* Czech Polish productivity 28.5% 25.1% 25% % *For Czech and Polish formal evaluation was done by Moravia Foror Slovak productivity increase was estimated by Fortera

  38. MORE DATA

  39. corpora collection tools comparability metrics named entity recognition tools terminology extraction tools ACCURAT TOOLKIT

  40. usecase AUTOMOTIVE MANUFACTURER

  41. very smalltranslation memories (just 3500 sentences) noin-domain corpora in target languages nomoney for expensive developments ?

  42. Terminology extraction Web crawling parallel monolingual Parallel data extraction from comparable corpora data collection workflow

  43. TMs Terminology glossary Parallel phrases Parallel Named Entities Monolingual target language corpus Resulting data

  44. General domain data as a basis Domain specific language model Impose domain specific terminology, named entity translations Add linguistic knowledge atop of statistical components SMT Training

  45. right data & right tools

  46. tilde.com technologies for smaller languages The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456

More Related