390 likes | 458 Views
Where do we stand? MT development, research, and deployment in Asia. Key-Sun Choi (KAIST) AAMT http://www.asianlp.org/ http://www.afnlp.org/ http://korterm.org/. Contents. China Japan India Malaysia Thailand Taiwan Korea UNL Associations related to MT. MT in China – 1980-1990’s.
E N D
Where do we stand? MT development, research, and deployment in Asia Key-Sun Choi (KAIST) AAMT http://www.asianlp.org/ http://www.afnlp.org/ http://korterm.org/
Contents • China • Japan • India • Malaysia • Thailand • Taiwan • Korea • UNL • Associations related to MT
MT in China – 1980-1990’s • To translate the scientific documents • From Russian and Western Countries’ language • Supported by government • No private company in early stage • TRANS-STAR: • 30,000 words/hour for 386 PC. • Basis dictionary includes 40,000 entries, • 10 specialized technical dictionaries • including 350,000 entries. • subject fields: computer, economics, telecommunication, ceramics, thermal power industry, printing machine industry, automobile/tractor industry, Petroleum prospecting, geology, Chemical industry.
MT in China – PresentEnglish-to-Chinese • GAOLI: • jointly by Beijing GAOLI Computer Co. Lid. & Linguistics Institute of CASS. • Basic lexical dictionary: 60,000 entries in which usage and grammatical function of every word is described in detail. • Translation accuracy: 80% • Readability of translated text: 80%-90% • 863-IMT/EC: • by the Institute of Computer Technology, Academia Sinica. • commercialized and got very good economic benefits.
MT in China – PresentChinese-to-English • SINO-TRANS • by the Company CS&S (China National Software & Technology Service Co.) at 1993. • Basic dictionary: 40,000 entries • Two special subject technical dictionaries: Naval ships and boats (9312 entries), rocket-gun (33,773 entries) • Linguistic rules: 1,000 rules
MT in China – PresentEnglish-to-Chinese + terminology • TONGYI system: • by the Tianjin DATONG computer software company • WINDOWS platform • Different special subject dictionaries: a. commonly-used scientific terms: 200,000 entries b.terms including 22 different subjects (e.g. machine building, telecommunication, aviation, medicine, etc): 3,000,000 entries • Good market strategy and service • Cooperation with enterprises
MT in China – PresentEnglish-to-Chinese + internet browsing + more user interface • YIWANG: • by SUNSHINE company of Shenzhen. • Highest translation speed: 100 sentences per second. • Internet browsing • YIBA: • by YAXINCHENG software technical company. • Three translation: on line, automatic, interface. • Open to users: to revise dictionary and rules • Rich special subject dictionaries: 30 subjects (e.g. Computer, telecommunication, medicine)
MT in China – PresentEnglish-to-Japanese • E-to-J • by JEC company in Beijing. • Technique of transformation from phrase tree (P-tree) to dependency tree (D-tree). • Closely integrated with word processor
MT in China – PresentExample-based MT: experimental systems • Japanese-Chinese EBMT: • computer department of Qinghua university in 1996. • corpus for Japanese and Chinese alignment sentences • The example unit is sentence • The similarity rate calculation based on word • DAYA EBMT: • Harbin Polytechnic University. • machine-aided translation system, human factor is very important • corpus is sentence-level alignment
MT in ChinaGovernment Funding: 1990’s • Hi-Tech 863 funding: • 863-IMT/EC system (English-Chinese) • SUNSHINE YIWANG system. • 905 Chinese Language Processing Project: • completed in 1998.
MT in ChinaUser’s English Level • The proportion of English level of user for TONGYI MT software: • Higher level: 16.5% • Middle level: 49.5% • Lower level: 34.1% • So the MT software must be oriented to common people
MT in ChinaPotential Users • The proportion of enterprise user for TONGYI MT software: • Small enterprises: 31.3% • Medium-scale & large-scale enterprises: 68.7% • So the MT software must be oriented to • large-scale & medium-scale enterprises, • but we don’t ignore the small enterprises that also has translation demand.
MT in ChinaRegional Distribution • User’s region distribution of MT software: • translation demand is concentrated in the big cities and developing regions. • Beijing: 18.7% • Liaoning: 7.9%, Jiangsu: 7.5% • Zhejiang: 6.5%, Hubei: 6.5%, Shanghai: 6.1% • Sichuan: 4.7%, Guangdong: 4.7% • Henan: 3.3%, Helongjiang: 3.3% • Hebei: 2.8%, Shanxi: 2.3%, Jilin: 2.3% • Yunnan: 1.9%, Neimeng: 1.5%, Gansu: 1.4% • Guizhou: 0.5%, Anhui: 0.5%
MT in China - Future and Strategies (1)Terminology Data Bank • MT software combines with terminology data bank • 1990: sub-committee of computer-aided in terminology of China set up. • This sub-committee is attached to the State Language Commission (SLC) of China • A series of national standards for terminology data-bank • Terminology Databank creation • Chinese-English: Since 1995, by ISTIC (Institute of Scientific and technical Information of China) • Remarkable databanks…
MT in China - Future and Strategies (2)Language Corpus Processing • Corpus construction: • the scale of 25 million Chinese characters (1999) • Automatic segmentation of Chinese writing text in corpus (97.68%, close test) • Automatic phrase bracketing and syntactic annotation for Chinese Corpus
MT in China - Future and Strategies (3)speech-to-speech translation • Chinese speech into Chinese text. • "SIDA-863A" system can recognize • 398 basic Chinese syllable, • recognition rate can arrive to 93%, • response time is less than 0.1 second, • input rapidity can arrive to 80 Chinese characters per minute
MT in China - Future and Strategies (4)combined with OCR and Internet • Internet MT: • SUNSHINE YIWANG, YAXIN YIBA, TONGYI, etc. • The advantage for MT software in INTERNET are: • Higher translation speed, real-time translation • Cheap price • Large machine dictionary • Possibility to add the new words
MT in China: New National Project • 973 project: from 2001 • supported by Chinese government. • For creative research in • Natural Language processing including machine translation. • automatic speech-to-speech translation system (English-Chinese) • developing in Institute of Automation of Academia Sinica.
MT in China – Survey Source • Prof. Feng, Zhiwei: • Secretary-general and the deputy chairman of • sub-committee of computer-aided in terminology of China • under the State Language Commission (SLC) of China. • Invited professor, KAIST (Sep/2001 – Aug/2002) • Dr. Liu, Qun • Institute of Computer Technology, Academia Sinica, Beijing
MT in Japan - 1 • More than 10 companies • For English, Chinese, Korean • Waiting for the new breakthrough • Internet • eLearning • Co-work with special-domain related companies • Technology transfer • Collaboration tools is ready to be in market • For translator’s collaboration workbench thru network • User interface: well-organized.
MT in Japan - 2 • Leading Systems • Cross-lingual patent retrieval • Prime • NTT/ALT • Japanese-to-English • Japanese-to-Malay • Japanese-to-Chinese • Speech Translation • ATR: C-Star
UNL in UN University • Through Universal Networking Language • With Hindi, Japanese, Persian, Indonesia-Malay, Thai, Chinese, Mongolian, Korean in Asian Region • Other region: Major European languages and English • Possible Users: • ITU mail translation
MT in Malaysia • No commercial product yet. • But in academic sectors • For application to • Internet • eLearning • eCommerce • Universiti Sains Malaysia • Computer Aided Translation Unit • Prof. Tang Enya Kong and Prof. Yusoff Zaharin
MT in India • 18 constitutional languages with 10 different scripts: • their script grammar and language grammars are quite similar • they have 40 to 80 percent vocabularies in common • less than 5 percent people who can work in English
MT in India: 1990-2001government effort for IT • TDIL (Technology Development of Indian Languages): • 1990-1991 • development of corpora, OCR, Text-to-Speech, machine translation; Standards for keyboard and internal code for information interchange • 2000-2001 • seven major initiatives: • Knowledge Resources, Knowledge Tools, Translation Support Systems, Human Machine Interface Systems, Localisation, Standardization and Language Technology Human Resource Development. • Thirteen Resource centres for Indian Language Technology Solutions (RC-ILTS) • were supported covering all 18 Indian languages.
MT in India: FutureDigital Unite and Knowledge for All • Indian Language Technology Vision 2010 has been prepared • with the Vision statement “ Digital Unite and Knowledge for All”. • Growing popularity of Internet • content creation, localisation, on-line gisting and summarisation, e-learning, Cross-Lingual Information Retrieval are being promoted to ensure information access in cyberspace in Indian languages • Source: Dr. Om Vikas • Senior Director and Head, Computer Development Division, Ministry of Information Technology
MT in ThailandGovernment 1996 • IT-2000 • To build a national information infrastructure (NII) • To invest in people, intends to concentrate on transferring IT knowledge to their children. • To build a Government Information Network (GINET) • Internet Users in Thailand (2000): 2.3M/66M • Age <10 10-14 15-19 20-29 30-39 40-49 50-59 60-69 70+ Total • Freq 18 124 261 1,238 572 187 32 27 2 2,461 • Percent 0.7 5 10.6 50.3 23.2 7.6 1.3 1.1 0.1 100 • Most of the Thai Internet users know English and other Internet languages at a basic or low intermediate level
MT in ThailandPARSIT • web-based Thai-English Machine Translation • since 1998 in cooperation with NEC (Japan). • very popular among Thai users • to translate English to Thai with the accuracy of 60%. • 20 percent mistranslating might be due to differences in expressions, slang, and sentence structures • http://www.suparsit.com/ • 300,000 hits/month • 25,000 users/month
MT in Thailand: Dictionary • a web-based dictionary: Lexitron • Thai-English and English-Thai dictionary
MT in Thailand: Future • to develop PARSIT translating system • Thai-to-English • and to other target languages. • Other language programs, such as OCR research, speech research, and language research • Thai full-text search engine
MT in Thailand: eASEAN • eASEAN Plan: • Multilingual Machine Translation Proposal • Thailand, Cambodia, Laos, Vietnam, Japan, Korea, English • source: • Dr. Virach Sornlertlamvanich [virach@nectec.or.th] • Dr. Prayong THITITHANANON (Rajabhat Institute Ubon Ratchathani, Thailand)
MT in Taiwan • Prof. Su, Keh-Ih • Machine translation • localization
MT in KoreaCommercial Product • English-to-Korean (Korean-to-English) • Enguide LNI Soft • E-Tran2001 NLP Lab (Seoul National University) • EZ Reader Language and Computer • ClickWorld ClickQ • Transmate IBM Korea • … • Japanese-to/from-Korea • Unisoft • Changmyung • … • Translation Memory • Localization companies develop for their own use: • ITI …
MT in KoreaTest suite for E-to-K • KAIST (http://korterm.kaist.ac.kr/ksurimal) • Supported by Ministry of Science and Technology • Exhaustive Evaluation • A variety of Sentences (5000 from high school textbooks, 10000 from internet e-business site) • To identify the R&D direction
serious average Problematic Part of System A Article Noun Pronoun Adverb Adjective Verb Part of Specech Mark Preposition Conjunction Relatives Structural Part Partial Structure Infinitive Participle Gerund Tense Idioms Number Sentence type Special Construction Comparative Subjunctive mood Sentence Structure Ellipsis Insertion Speech Inversion Lists Negation Multiple part of speech Realtion and Scope of modification Phrase Semantic Part V+N V+Prep. N+V N+N Collocation N+Prep. Adv.+N Adv.+ Prep N V Etc. Ambiguous word NP VP Idioms PP AP(adjective phrase) Sentence Natural Expression Different meaning between singular and plural
MT in Korea • Caption/EK and KE - ETRI • Real-time translation of caption in the TV news • CNN for English-Korean • KBS for Korean-English • Chinese-Korean MT • Pohang University of Science & Tech. • KAIST • ETRI (Korean-to-Chinese) • Companies: Konan tech. • Japanese-Korean MT (technology transfer) • Pohang University of Science & Tech.
Online language populations (2001 June) • English 45%, Japanese 9.8%, Chinese 8.4% • German 6.2%, Korean 4.7%, Spanish 4.5% • Italian 3.6%, French 3.4%, Portuguese 2.5% • Dutch 2%, Russian 1.9% • GlobalReach. Global Internet Statistics (by Language). • http://www.glreach.com/globstats/index.php3
Organizations in Asia • AAMT • AFNLP (Asia Federation of NLP Assocations) • http://asianlp.org/ • http://afnlp.org/ • Eafterm (East Asia Terminology Forum) • http://eafterm.org/ • Language Resource Sharing and Management • Jan/2001 – workshop in Tokyo, invited by Japan • Prof. Tanaka, Hozumi (Chair; GSK) • Nov/2001 – workshop in NLPRS-2001, Tokyo • ISO TC37/SC4 (Language Resource Management) under organization
MT Status in Asia Thank you.