370 likes | 507 Views
International Collaboration for the Research on Language Technologies. Masao Utiyama @ NICT 8 th Dec. 2017 ONA2017. Abstract. Many languages exist in the world. No one can understand all languages.
E N D
International Collaboration for the Research on Language Technologies Masao Utiyama @ NICT 8th Dec. 2017 ONA2017
Abstract • Many languages exist in the world. • No one can understand all languages. • This is why we need the international collaboration for the research on language technologies. • This talk presents the projects NICT participated in. • I also introduce the collaboration between NIPTICT and NICT. • I show how they work together in developing Khmer language technologies. • I also introduce my research on developing machine translation (MT)
Outline • Asian Language Treebank (ALT) • Khmer ALT with NIPTICT • U-STAR • Khmer ASR (automatic speech recognition) with NIPTICT • My own research on developing parallel corpora for MT
Open Collaboration for Developing and Using Asian Language Treebank Project members: Hammam Riza, Michael Purwoadi, Gunarso, TeduhUliniansyah (BPPT) Aw Ai Ti, Sharifah MahaniAljunied (I2R) Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thái (IOIT/UET) VichetChea, Rapid Sun, Sethserey Sam, Sopheap Seng (NIPTICT) Khin Mar Soe, KhinThandarNwet (UCSY) Masao Utiyama, Chenchen Ding (NICT) Chai Wutiwiwatchai, ThepchaiSupnithi, PranchyaBoonkwan (NECTEC) Ria A. Sagum, Michael B. dela Fuente (PUP)
Current Status of Asian NLP resources • No publicly available treebanks for most of Asian languages • Development of Asian NLP is slow Difficult to compare research results among Asian NLP
Objective of Asian Language Treebank • Provide Asian Language Treebank for free for research • Cover many under-resourced Asian languages • Facilitate the rapid development of Asian NLP • Provide the common ground for comparison/evaluation of Asian NLP • We will release ALT with a Creative Commons Attribution-NonCommercial-ShareAlike
What will be the Asian Language Treebank (ALT) Indonesian Japanese Khmer Malay Myanmar Vietnamese Thai Laos Filipino 20,000 English Wikinews sentences Translated into Annotated with Word segmentation, POS, Syntax, Word alignment
Samples (en, id, ja, km, ms, my, vi, th, lo, fil) • Italy have defeated Portugal 31-5 in Pool C of the 2007 Rugby World Cup at Parc des Princes, Paris, France. • Italia berhasilmengalahkan Portugal 31-5 di grup C dalamPialaDunia Rugby 2007 di Parc des Princes, Paris, Perancis. • フランスのパリ、パルク・デ・プランスで行われた2007年ラグビーワールドカップのプールCで、イタリアは31対5でポルトガルを下した。 • អ៊ីតាលីបានឈ្នះលើព័រទុយហ្គាល់ 31-5 ក្នុងប៉ូលCនៃពីធីប្រកួតពានរង្វាន់ពិភពលោកនៃកីឡាបាល់ឱបឆ្នាំ2007ដែលប្រព្រឹត្តនៅប៉ាសឌេសប្រីន ក្រុងប៉ារីស បារាំង។ • Italitelahmengalahkan Portugal 31-5 dalam Pool C padaPialaDuniaRagbi 2007 di Parc des Princes, Paris, Perancis. • ပြင်သစ်နိုင်ငံ ပါရီမြို့ ပါ့ဒက်စ် ပရင့်စက် ၌ ၂၀၀၇ခုနှစ် ရပ်ဘီ ကမ္ဘာ့ ဖလား တွင် အီတလီ သည် ပေါ်တူဂီ ကို ၃၁-၅ ဂိုး ဖြင့် ရေကူးကန် စီ တွင် ရှုံးနိမ့်သွားပါသည် ။ • Ý đãđánhbạiBồĐàoNhavớitỉsố 31-5 ở Bảng C Giảivôđịch Rugby thếgiới 2007 tại Parc des Princes, Pari, Pháp. • อิตาลีได้เอาชนะโปรตุเกสด้วยคะแนน31ต่อ5 ในกลุ่มc ของการแข่งขันรักบี้เวิลด์คัพปี2007 ที่สนามปาร์กเดแพร็งส์ ที่กรุงปารีส ประเทศฝรั่งเศส • ອິຕາລີໄດ້ເສຍໃຫ້ປ໊ອກຕຸຍການ 31 ຕໍ່ 5 ໃນພູລ C ຂອງ ການແຂ່ງຂັນຣັກບີ້ລະດັບໂລກປີ 2007 ທີ່ ປາກເດແພຣັງ ປາຣີ ປະເທດຝຣັ່ງ. • Natalo ng Italya ang Portugal sapuntosna 31-5 saGrupong C noong 2007 saPandaigdiganglaro ng Ragbisa Parc des Princes, Paris, France.
Project Goal (Final outcome) • NICT will develop and release the parallel corpus for ALT • Each member institute shall develop and release ALT for each language • Each member institute shall decide the amount of ALT, which will be developed and released by that institute • ALT will be used for research and development on Asian NLP
Results so far indicating fruitful collaboration • First meeting was hosted by NIPTICT (Apr. 2016) • Second meeting was hosted by BPPT (Oct. 2016) • Third meeting was hosted by UCSY (Aug. 2017) • Each member institute started developing each ALT • ALT resources are available at the project page http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/index.html • Corporation with U-STAR Khmer SMT is released to U-STAR Parallel Corpora are used in U-STAR for Machine Translation
Progress of Khmer ALT at NICT with NIPTICT Chenchen Ding†, Hour Kaing‡, Masao Utiyama†, Vichet Chea‡, Eiichiro Sumita† †Advanced Translation Technology Laboratory, ASTREC, NICT, Japan ‡NIPTICT, Cambodia
Outline • Progress on the Khmer data in ALT • NICT with NIPTICT • Resources under checking • Tokenization and part-of-speech (POS) annotation guidelines for Khmer • Tokenized and POS-tagged data: 90% data checked, by NOVA • Issues in temporary Khmer data • Orthographic errors • Multi-form Khmerization • Final outcome
Annotation Guidelines for Khmer • Released on ALT home page • http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/Khmer-annotation-guideline.pdf • Updated along with the data preparing • A temporary stable version • Manual checking and correcting • Around 90% of all 20,106 sentences have been checked • Will be finished within 2017 FY • Will be released in 2018 FY after further cleansing
Issues in Temporary Khmer Data • Orthographic errors • 5 general-error cases detected and corrected • Other undetected cases need to be manual corrected • Multi-form words (Khmerization) • Names (country, person, etc) are written in different forms.
Orthographic Errors • កណ្ដាល → ក ណ ្ ដ ា ល gone dal (correct pronunciation) • កណ្តាល → ក ណ ្ ត ា ល gone tal • បន្ដ → ប ន ្ ដ bone dor (correct pronunciation) • បន្ត → ប ន ្ ត bone tor • តន្ត្រី → ត ន ្ ត ្ រ ី (correct) • តន្រ្តី → ត ន ្ រ ្ ត ី • តន្រី្ត → ត ន ្ រ ី ្ ត 1. ្ ដ → ្ ត 2. ្ រ ្ (con.) → ្ (con.) ្ រ 3. (d. vowel) ្ (con.) → ្ (con.)(d. vowel)
Orthographic Errors • តាំង → ត ា ំ ង (preference) • តំាង → ត ំ ា ង • ញ៉ាំ → ញ ៉ ា ំ (preference) • ញុំា → ញ ុ ំ ា • ញាំុ → ញ ា ំ ុ • ញាុំ → ញ ា ុ ំ • ញុាំ → ញ ុ ា ំ • ញំាុ → ញ ំ ា ុ • ញំុា → ញ ំ ុ ា 4. ំ ា → ា ំ 5. ុ ំ ា → ៉ ា ំ ា ំ ុ → ៉ ា ំ ...
Orthographic Errors • 5 rules • ្ ដ => ្ ត • ្ រ ្ (con.) => ្ (con.) ្ រ • (d. vowel) ្ (con.) => ្ (con.) (d. vowel) • ំ ា => ា ំ • ុ ា ំ => ៉ ា ំ • con. = consonant • d. = dependent, always standing with a consonant
Multi-form Khmerization • New Zealand Nouvelle-Zélande ???
Final outcome up to F.Y. 2018 • Conditions and reasons • We don’t have yet the guideline for Syntax annotation. • But, we already have temporary stable version of word segmented and POS tagged corpus. • So, our plan is to making all the word segmentation and POS tagging first. Then, the word alignment will be started later. Lastly, it will be syntax annotation. • Final outcome • Word segmentation : 20,106 sentences • POS tagging : 20,106 sentences • Syntax annotation : 9,000 sentences
Universal Speech Translation Advanced Research“TO OVERCOME THE LANGUAGE BARRIERS AROUND THE WORLD”EST. SINCE 2010
HISTORY 1991 ≈ 2005 2006 2007 2008 2009 2011 2012 2013 2014 2015 2016 2017 C-STAR A-STAR U-STAR France, Portugal, Turkey, England, Germany, Hungary, Poland, Belgium, Ireland Management transferred from NICT to I2R Vietnam, Singapore Japan, China, Korea, Indonesia, Thailand, India Korea, Italy, France, China,UK, Switzerland, Sweden, India Taiwan, Cambodia Japan, US, Germany Bhutan, Mongolia, Nepal, Pakistan, Philippines, Sri Lanka ITU-T Recommendations F.745 and H.625 published VoiceTra4U for iOS released A-STAR Network-based S2ST U-STAR Network-based S2ST Standardization activities transferred to ITU-T SG16 from APT SNLP-EG VoiceTra4U for Android released The Consortium for Speech Translation Advanced Research (C-STAR) started out over 20 years ago to develop multilingual speech translation systems. Numerous post activities and workshops derived from C-STAR such as the International Workshop on Spoken Language Translation (IWSLT).The Asian Speech Translation Advanced Research (A-STAR)was formed in the Asian regions to develop a network-based speech-to-speech translation (S2ST) system. A-STAR initiated in standardizing international communication protocols, especially in the S2ST field, in association with the Asia-Pacific Telecommunity Standardization Program (ASTAP) and launched "the first Asian network-based speech-to-speech translation system", on July 29th, 2009. The system enabled real-time, location-free, multi-party communication between speakers using different Asian languages, and confirmed the feasibility of network-based S2ST protocol. In 2010, the standardizing procedures at ASTAP were transferred to International Telecommunication Union Standardization Sector (ITU-T) as A-STAR shifted to U-STAR, transforming not only its name but its organization to a worldwide consortium with the aim of establishing a more global system. U-STAR’s network-based S2ST is developed based on the ITU-T Recommendations F.745, and H.625, which were both published in October, 2010.
MEMBERS (as of April, 2017) U-STAR Members: 33 Institutes from 26 countries/regions signed the MOU (Memorandum of Understanding) which is valid until March 31st, 2019) U-STAR covers 95% (orange areas) of the world’s official languages
Main Activities R&D of Network-based Speech-to-Speech Translation System R&D Tasks:-Collect speech and text data and dictionaries of local languages -Link local languages to multilingual communication systems -Construct standalone ASR/MT/TTS engines -Connect engines to multilingual systems using ITU-T standardization protocols via network - Run applications using the network-based S2ST - Utilize collected data from the application to improve performance - Extend developed technologies to commercial fields Communication Hindi Client Client - Each respective member builds and operates servers for speech recognition, machine translation, and speech synthesis. - Users will select a set of languages to be translated within the iPhone application. - According to what the users have selected, the control server (operated by NICT) connects to S2ST servers (operated by each member). *Communication protocols and interfaces are implemented based on the ITU-T Recommendations F.745 and H.625. Network Workshops S2ST - Held once or twice a year to accelerate research collaborations and share progress. - Often held along with other international workshops (i.e. Interspeech, ICASSP, O-COCOSDA, etc.) so more people would be able to participate.Notable Invited Keynote Speakers:- Prof. Dr. Alexander Waibel (Carnegie Mellon University, USA / Karlsruhe Institute of Technology, Germany / Jibbigo, Chairman and Founder, USA & Germany / Advisory Board of the U-STAR Consortium) - Mr. Simão Campos (Counsellor, ITU-T Study Group 16, ITU Telecommunication Standardization Bureau)- Prof. Dekai Wu (Department of Computer Science and Engineering and the Human Language Technology Center at the Hong Kong University of Science and Technology) TTS TTS MT ASR MT ASR HI HI JA JA→HI JA Japanese Gurgaon, 2013 London, 2012 Gurgaon, 2013 HI→JA S2ST S2ST S2ST S2ST S2ST Phuket, 2014 Florence, 2014 Lyon, 2013
ORGANIZATION and ROLES • Advisory Board:- Dr. Alex Waibel (CMU, USA / KIT, Germany)- Prof. Satoshi Nakamura (NAIST, Japan)- Prof. Shyam Agrawal (KIIT, India)* Give informed guidance and suggestions* Participate in workshops Technical Support:- Mr. Li Zhongwei, I2R (Singapore) *Construct, manage, and connect engines and servers between members*Troubleshooting Coordinator:- Dr. Haizhou Li, I2R (Singapore) *Host and chair workshops*Recruit new members U-STAR Secretariat:- Ms. Ai Ti and Ms. Sharifah Mahani, I2R (Singapore) *Coordinate between all affiliated members *Prepare for workshops, demonstrations *Prepare slides, internal and external reports *Prepare and coordinate contracts: MOU, TLA, etc. *Accountant works for workshops, sponsors *Help desk and customer support for publicly-released applications *Support communication between engineers of each member *Construct and manage internal and external websites *Handle public relations (flyers, posters, etc.) *Manage corpora, documents, and data
Khmer Speech Processing From Summarize Report of SokyKak of NIPTICT when working at NICT as an intern
Lexical Data Construction Compound word Single word 34K Keywords Chuon Nath dict. BTEC websites Khmer words Pali and Sanskrit Loanwords • 57 unique phones • 21 consonant phones • 36 vowel phones
Language Model • All BTEC data is 96K sentences • 3-gram language model is built by using SRILM toolkit.
Speech Data construction • 4K sentences are selected from BTEC based on the balance of CC (consonant-consonant) that will be used to record voice data. • Voice recording is conducted in 3 places.
Experimentation Result (WER %) • GMM-BMMI:Gaussian mixture model based using boosted Maximum mutual information • DNN-CE:Deep Neural Network model based using the cross entropy criterion • DNN-sMBR: Deep Neural Network model based using state-level minimum Bayes risk criterion Decoding by NICT SprinTra Decoder. The texts are from BTEC. No noise.
Fundamental Structure of NMT Thang Luong; Hieu Pham; Christopher D. Manning. (2015)Effective Approaches to Attention-based Neural Machine Translation. EMNLP
NMT is much better with many parallel texts NMT Translation Accuracy SMT 1 million sentences Numberof parallel sentences
Automatic Parallel Corpus Construction • Getting the best matching sentences between the Japanese and English texts • Only needs parallel texts. • Not needing parallel sentences • Has produced over 500 million parallel sentences since Utiyama et al., 2003 English text Best matching sentences Masao Utiyama and Hitoshi Isahara. (2003) Reliable Measures for Aligning Japanese-English News Articles and Sentences. ACL-2003, pp. 72--79. Japanese texts
Parallel texts are gathered with collaboration Parallel corpora scattering N府 Y社 L県 X社 M都 Z社 B社 A社 Challenge is Gathering
Open Collaboration for Developing Large Parallel Corpora are welcome