130 likes | 275 Views
Tapta4IPC: helping translation of IPC definitions. Translation assistant for patent titles and abstracts in PATENTSCOPE - potential use in translating IPC definitions collaboration. Bruno Pouliquen ( Bruno.Pouliquen@wipo.int ). 25 feb 2013, IPC workshop. Introduction.
E N D
Tapta4IPC: helping translation of IPC definitions Translation assistant for patent titles and abstracts in PATENTSCOPE - potential use in translating IPC definitions collaboration Bruno Pouliquen (Bruno.Pouliquen@wipo.int) 25 feb 2013, IPC workshop
Introduction • Statistical Machine Translation: bottom-up approach • no rules, no grammar, no dictionary, no terminology, only the parallel texts (bitexts) system data • We use an open-source system: Moses • Tapta: Translation of Patent Titles and Abstract • Originally built to translate patent applications • Adapted to various applications
Tapta framework sourcelanguage targetlanguage Gather/convert data Bitexts clean post-filter re-clean prune binarize optimize Publish train-model Our system prepares the data for Moses, apply some post-processing (filter, pruning, binarization, optimization…) and offers a Web interface to translate
Introduction: Tapta • In WIPO, as part of Patentscope (English,French,German,Chinese,Japanese) • eg. http://patentscope.wipo.int/translate/simpleTranslate.jsf?id=JP75694586&langpair=jaen • Automatic translation of a patent application only available in Japanese… • In United Nations (English from/into Arabic,French,Spanish,Russian & Chinese)
Technical workflow sourcelanguage Filter wrong language Filter wrong language Translation client Sentence-split Translation server Tokenization Bitexts Sentence-align Moses decoder Moses decoder Moses decoder Score alignment reordering model language model phrase table Filter align. Filter align. Moses’ training targetlanguage Bitexts aligned at sentence level
IPC context • Gather data: • Get existing definitions • Add IPC schema (xml on WIPO website) • Add “few” texts from patents • “learn” translation model • Translate new texts
Get existing data, build parallel texts Existing definitions… Bitext: training material… IPC schema… <ipcEntry kind="1" symbol="B61F0019020000" ipcLevel="A" entryType="K" lang="FR"> <textBody> <title><titlePart> <text>Couvre-roues</text> </titlePart></title></textBody> </ipcEntry> <ipcEntry kind="1" symbol="B61F0019020000" ipcLevel="A" entryType="K" lang="EN"> <textBody> <title> <titlePart> <text>Wheel guards</text> </titlePart></title></textBody> </ipcEntry> Patent texts… WO/2013/014517 (EN) TYRE FOR VEHICLE WHEELS(FR)PNEUMATIQUE POUR ROUES DE VÉHICULE
How well it works? Automatic evaluation: BLEU score • Principle : similarity of n-grams between evaluated and reference sentences On IPC definition English-French: bleu=48% (without patent data: 44%) Good quality needs human post-editing
Tapta4IPC prototype (1) Live demo using: http://patentscope.wipo.int/translateUN/translateIPC.jsf
Tapta4IPC prototype (2) http://fulty3.wipo.int:8080/Wtapta/translateIPC.jsf
Conclusion / future work • This is a prototype, but the quality looks already acceptable • Human evaluation? • Better integrate the tool • In PCA6TRANSDEF ? • Other languages?
Tapta4IPC in various languages • Tapta4IPC should work reasonably well on the following languages (we have built some language specific tools and we have patent corpora): • German • Japanese • Korean • Spanish • Dutch • Portuguese • Chinese • Russian • More challenging: • Czech, Slovak, Polish (many word forms, training corpus?) • Estonian (even more word forms, would in theory require more training corpus) • Other languages: Arabic, Italian, Danish, Swedish etc.
Thank you for your attention • شكرا لكم على اهتمامكم • Merci pour votre attention! • 感谢您的关注 • Grazie per la vostra attenzione! • ¡ Gracias por su atención ! • Vielen Dank für Ihre Aufmerksamkeit! • Obrigado pela vossa atenção! • Dziękuję bardzo za Państwa uwagę! • Děkujeme za Vaši pozornost! • Ďakujem ti veľmi pekne za tvoju pozornosť • Tänan tähelepanu eest! • Благодарим за Вашето внимание! • Tak for Jeres opmærksomhed! • Thank you for your attention!