1 / 34

Machine Translation activities at WIPO

This presentation provides an overview of the history and integration of machine translation activities at WIPO, including cross-lingual search, integration of third-party MT engines, and the development of in-house MT engines. It also explains the reasons why WIPO is interested in machine translation and its importance for translating patent information.

mscully
Download Presentation

Machine Translation activities at WIPO

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013

  2. Agenda • History of machine translation activities at WIPO • Cross Lingual Search • Integration of third party MT engines • Development of in-house MT engines tuned for specific tasks • Strategy

  3. History of MT activitesAt WIPO

  4. MT at WIPO: history Why is WIPO interested in Machine Translation? • The IB of the PCT is responsible for translating titles, abstracts, drawing legends, search reports, written opinions and IPRPs for the published PCT applications. (This represents xx millions of words translated per year) • WIPO is disseminating multi lingual Patent Information through it’s portal PATENTSCOPE. Multi lingual functions are required to enable the largest number of users worldwide to search and browse patent applications in many different languages

  5. MT at WIPO: an overview • Mid 2007: International RFP to implement “cross lingual Search” functions in PATENTSCOPE • End of 2008: project failure due to the supplier’s inability to deliver a quality product • 2009: First Statistical Machine Translation experiments performed in-house. Development of a first engine to translate titles from English to French • 2009-2010: development of the PATENTSCOPE CLIR system in 5 languages (EN, FR, DE, ES, JA)

  6. MT at WIPO: an overview • Summer 2010: Integration of Google Translate in PATENTSCOPE to translate result lists, descriptions and claims • March 2011: development and deployment of WIPO’s first own MT system tuned for patents’ titles and abstracts (TAPTA) • April 2011: extension of CLIR to cover the Chinese, Korean, Russian and Portuguese languages • August 2011: release of PCT corpus: COPPA

  7. MT at WIPO: an overview • November 2011: Integration of KIPO’s machine translation system in PATENTSCOPE (for the KOEN language pairs) (until December 2012) • December 2011: Integration of Microsoft Translate into PATENTSCOPE • January 2012: extension of CLIR to cover the Dutch, Italian, and Swedish languages • Avril 2012: PATENTSCOPE CLIR functionality integrated into Minesoft’s PATBASE through a web service

  8. MT at WIPO: an overview • June 2012: provide MT transfer knowledge to UN and ITU • October 2012: UN, ITU, Wipo Marks in production • November 2012: Extension of Tapta to cover Japanese and German • February 2013: Evaluation results: Tapta better than Microsoft and Google (title+abtract in all language pairs, similar results in UN) • June 2013: Outsourcing contract using TAPTA for the EN=>FR language pair

  9. 2. CLIR (CrossLingual Information Retrieval)

  10. WIPO’s Cross-lingual search: principle • Free tool available at http://patentscope.wipo.int/search/clir/clir.jsp?interfaceLanguage=en • Enter a search query in either EN, DE, ES, FR, JP, RU, ZH, PT, IT, DU, SE and it will be expanded into the other languages (keywords translation) • Automatic or supervised mode • balance between precision and recall set by the user • Disambiguation by technical domains and by selection of appropriate synonyms • Built from bilingual dictionaries extracted statistically from Patent corpuses without supervision

  11. Interface : Cross-lingual (CLIR)- Automatic

  12. CLIR: automatically enriched query (EN_TI:("hearing aids" OR "hearing prosthetic"~21 OR "auditory aids"~21 OR "auditory prosthetic"~21) OR EN_AB:("hearing aids" OR "hearing prosthetic"~21 OR "auditory aids"~21 OR "auditory prosthetic"~21)) OR (DE_TI:("Hörgeräte" OR "Hörhilfegeräten") OR DE_AB:("Hörgeräte" OR "Hörhilfegeräten")) OR (ES_TI:("audífonos") OR ES_AB:("audífonos")) OR (FR_TI:("audioprothèses" OR "appareils de correction auditive" OR "production d'appareils auditifs") OR FR_AB:("audioprothèses" OR "appareils de correction auditive" OR "production d'appareils auditifs")) OR (JA_TI:("穴形補聴器") OR JA_AB:("穴形補聴器")) OR (KO_TI:("보청") OR KO_AB:("보청")) OR (PT_TI:("audiofone" OR "auxìlio de audição") OR PT_AB:("audiofone" OR "auxìlio de audição")) OR (RU_TI:("слуха протезно"~22 OR "прослушивания протезно"~22 OR "слуха спидом"~22 OR "слуха наведения"~22 OR "прослушивания спидом"~22 OR "прослушивания наведения"~22 OR "слухоулучшающих протезно"~22 OR "слуховой протезно"~22 OR "слухоулучшающих спидом"~22) OR RU_AB:("слуха протезно"~22 OR "прослушивания протезно"~22 OR "слуха спидом"~22 OR "слуха наведения"~22 OR "прослушивания спидом"~22 OR "прослушивания наведения"~22 OR "слухоулучшающих протезно"~22 OR "слуховой протезно"~22 OR "слухоулучшающих спидом"~22)) OR (ZH_TI:("助听器") OR ZH_AB:("助听器"))

  13. Why use PATENTSCOPE CLIR? • Search full text collections simultaneously in many foreign languages without knowing them (not English centric) B) Improve significantly the number of relevant results without increasing significantly the number of irrelevant results • 3356 results in English titles or abstracts for hearing AND aids • 3825 results obtained with CLIR searching in titles or abstracts in all languages C) Have confidence in your searches: No black box: users have access to the CLIR generated boolean queries (albeit complex) and have the full control on them D) Have a responsive system even for complex queries the query in the previous slide executes in less than 1/2sec in PATENTSCOPE

  14. What next? • Improve terminology coverage of already supported languages • Add other languages (Arabic)? Condition to add a language: • Having more than 200’000 (ideally 500’000) titles and if possible abstracts in the language available with associated high quality translations in English

  15. 3. Integrated third-party MT engines

  16. 9 Interface languages:Deutsch |English|Español |Français |日本語 | 한국어 |Português |Русский |中文 |

  17. Integrated 3rd party MT: principles • Use free MT services available on the internet (so far Google Translate and Microsoft translate) • Translates from the source language(s) to the language set by the user in the graphical interface • Translates results lists and description and claims only when requested by the user • 65 languages supported using Google Translate! • Quality of Google Translate improved for patent texts thanks to EPO sharing patent corpora with Google

  18. Search Results – machine translate

  19. Search Results – machine translate

  20. Search Results – machine translate

  21. Description – machine translate

  22. Description – machine translate

  23. Description – machine translate

  24. Description – machine translate

  25. 4. Development of in-house MT engines tuned for specific tasks

  26. In-house MT engines • MT systems building expertise developed in-house since 2009 • Corpora approach: started using PCT corpus of titles and abstracts • Uses open source Statistical Machine Translation: Moses (WIPO is a committer with a specific branch) • First system developed: Translation Assistant for Patent Titles and Abstracts (TAPTA: publicly available at https://www3.wipo.int/patentscope/translate) • Same system (trained on different corpora) developed for the United Nations, for ITU and for translation of Madrid Trademarks goods and services

  27. Hovering the mouse on the left highlights corresponding segment on the right (and vice-versa) TAPTA

  28. How well does it work? • Tapta better than Google and Microsoft for abstracts • English->French: Tapta BLEU 46.9 • 15 abstracts* • Google 45.9 / Google-EPO 45.8 / Microsoft 36.7 • German->English: Tapta BLEU 38.3 • 11 title & abstracts* • Google 37.8 / Microsoft 26.8 • Human evaluation: adequacy/fluency (Tapta: 79%, Google 65%, Microsoft 67%) • English->Japanese: Tapta BLEU 25.4 • 1000 segments (title & abstract)* • Google BLEU 22.3 • English->Chinese: Tapta BLEU 22 • 1000 segments (title & abstract)* • Google BLEU 17.5 (*) from recent patent applications (published in March 2013), compared to one reference

  29. Also in United Nations BLEU scores Aims at assisting UN translators when translating UN official documents from AR,ES,FR,RU,ZH into EN (both directions)

  30. Findings • Customized MT engines built on narrow language domains outperform state of the art general purpose MT engines • TAPTA automatic evaluations are better than Google Translate on patent titles and abstracts (BLEU scores) • Size of corpora matters, as well as quality of sentence-pairs alignments • Building customized SMT engines is sustainable and does not require large human, IT and financial resources

  31. Bibliography • TAPTA: A user-driven translation system for patent documents based on domain-aware Statistical Machine Translation,  B. Pouliquen, C. Mazenc, A. Ioro in proceedings of the European Association for Machine Translation conference, May 2011, Leuven Belgium • COPPA, CLIR and TAPTA: three tools to assist in overcoming the Patent language barrier at WIPO, B. Pouliquen, C. Mazenc in proceedings of Machine Translation Summit 2011, September 2011 Xiamen China • Statistical Machine Translation prototype using UN parallel documents, B. Pouliquen, C. Mazenc, C. Elizalde, J. Garcia-Verdugo in proceedings of the 16th EAMT conference, 28-30 May 2012, Trento, Italy • (forthcoming ) Large-scale multiple language translation accelerator at the United Nations, B. Pouliquen, C. Elizalde, M, Junczys-Dowmunt, C. Mazenc, J. Garcia-Verdugo in proceedings of Machine Translation Summit 2013, Nice, France

  32. 5. Strategy

  33. WIPO’s MT strategy • Make best use of state-of-the-art technologies available in open source and promote further their development • Adapt these technologies to the patent domain (using Patent corpora, Patent classification,…) for practical use cases • Develop patent MT systems and put them at disposal of the largest number of users to bridge the language barrier (notably in patent searching) • Cooperate with interested offices by sharing experience, corpora and software solutions • Adopt a barrier free dissemination of patent corpora when possible to foster research in MT for patent texts • Investigate Cloud technologies to be able to ramp up to industrial internet solutions • TAPTA: Extend coverage (languages, claims, descriptions)

  34. Questions?

More Related