300 likes | 416 Views
Semi-Automatic Thai Computational Lexicon Construction: KULEX. Daoyos Noikongka Mukda Suktarachan Assoc.Prof. Dr. Asanee Kawtrakul
Semi-Automatic Thai Computational Lexicon Construction: KULEX Daoyos Noikongka Mukda Suktarachan Assoc.Prof. Dr. Asanee Kawtrakul NAiST : The Specialty Research Unit of Natural Language Processing and Intelligent Information System Technology, Department of Computer Engineering, Faculty of Engineering, Kasetsart University.
Outline • Introduction • Objective • System Overview • Main Problem • Methodology • Experiment & Evaluation • Conclusion & Future Work Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Introduction • The computational lexicon is the fundamental repository of word information • It is a very important resource for the research in natural language processing area. • English computational lexicons : WordNet [1], Roget's Thesaurus [2] • Thai computational lexicons : TCL’s Computational Lexicon [3], Lexitron [4] and Lexibase [5] • The printed dictionaries : Klang Kam, Royal Institute Dictionary Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Introduction Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Introduction • Heterogeneous word information are useful for many NLP applications. • Pronunciation : using for speech processing. • Register : using for style checking. • Classifier : using for support machine translation. • The existing Thai computational lexicons still lack some word information. • Computational lexicon construction greatly has time consuming and labor work. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Objective • Designing framework for semi-automatically constructing Thai computational lexicon which has heterogeneous word information from multiple resources. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Printed Book ... Klang Kam Royal Institute Dictionary Hierarchical Concept Word Word Def. Def. . . . POS POS Ex. Ex. System Overview Preprocessing Word Information Parsing Validation & Manipulation KULEX Lexicon Information Integration Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Klang Kam 24,950 word entries and17,199 senses. 5 major concepts, 573 medium concepts and 2,380 minor concepts. Word information: hierarchical concept, definition, classifier, word usage method and word usage example. Main Resources • Royal Institute Dictionary • 32,367 word entries and 33,165 senses. • Word information: part of speech, source of word, pronunciation, definition Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Optical Character Recognition Converting image document into text document. ArnThai 2.5 [6] is applied. The correctness is 90% - 95%. Validation Manual correcting the result from OCR. Preprocessing ArnThai 2.5 (NECTEC) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Royal Institute Dictionary Word Information Parsing ข้าว น. ชื่อไม้ล้มลุกหลายชนิด หลายสกุล ในวงศ์ Gramineae โดยเฉพาะชนิด Oryza sativa Linn.ซึ่งใช้เมล็ดเป็นอาหารหลัก มีหลายพันธุ์ เช่น ข้าวเจ้า ข้าวเหนียว ปลูก ก. [ปลูก] เอาต้นไม้หรือเมล็ด หน่อ หัว เป็นต้น ใส่ลงไปในดินหรือสิ่งอื่นเพื่อให้งอกหรือให้เจริญเติบโต, ทำให้เจริญเติบโต, ทำให้งอกงาม เช่น ปลูกไมตรี Word POS Definition Definition Word POS Pronunciation Word usage example Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Word Information Parsing • Klang Kam ท-ม สรรพสิ่ง น1-น346 โลกตามธรรมชาติและตามจินตนาการ น140-น328 สิ่งมีชีวิตนอกจากมนุษย์ น250-น327 พืช น260-น280 พืชที่ใช้เป็นอาหาร น 260 ข้าว ข้าว พืชที่ใช้เป็นอาหารสำคัญ มีหลายชนิด หลายพันธุ์ [ล. ว่าเม็ด, เมล็ด; เรียกตามภาชนะที่บรรจุ เช่นถุง, จาน] ย-ล มนุษย์กับสรรพสิ่ง ร1-ร137 มนุษย์กับสิ่งต่างๆทั่วไป : การสร้างสรรค์และเปลี่ยนแปลง ร1-ร12 การทำให้มีขึ้น คงอยู่ และหมดไป ร1 การทำให้มีขึ้น ปลูก ทำให้เกิดพรรณไม้ เช่น ปลูกผัก; โดยปริยายหมายถึง ทำให้เกิดที่อยู่อาศัย เช่น ปลูกบ้าน, ปลูกพลับพลา Hierarchical Concept Classifier Classifier Word Definition Hierarchical Concept Word usage example Word usage example Definition1 Word Definition2 Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Pig Pig Difficulty&Ease Difficulty&Ease Drug Drug Word: หมู Word: หมู Word: หมู Word: หมู Word:หมู(moo) Word:หมู(moo) Def: ใบพลู... Def: ใบพลู... Def: ง่าย... Def: ง่าย... Word: หมู(moo) Word: หมู(moo) Word: หมู(moo) Word: หมู(moo) Def: animal… Def: animal… POS: นาม POS: นาม POS: วิเศษณ์ POS: วิเศษณ์ Def: a leaf… Def: a leaf... Def: easy… Def: easy... Reg: ปาก Reg: ปาก Reg: ปาก Reg: ปาก POS: noun POS: noun POS: noun POS: noun POS: adj. POS: adj. Main Problem Which is concept category that the words and theirs information should be in? Word information from Klang Kam: Hierarchical Concept Word information from Royal Institute Dictionary Word information from Royal Institute Dictionary Hierarchical Concept Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Lexicon Information Integration • Word classification based on head word matching • Using when there are words which have similar surface forms between both dictionaries. (Royal Institute Dictionary and Klang Kam) • Word classification based on definition-concept category matching • Using when the surface forms of words from Royal Institute Dictionary has not similar surface forms of any word from Klang Kam. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Word classification based on head word matching Words which have similar surface form and word sense in each dictionary should have similar word definition Royal Institute Dictionary หมู(moo) น. ชื่อสัตว์เลี้ยงลูกด้วยนมหลายชนิดในวงศ์ Suidae เป็นสัตว์กีบคู่ ตัวอ้วน จมูกและปากยื่นยาว มีทั้งที่เป็น สัตว์เลี้ยงและที่เป็นสัตว์ป่า หาอาหารโดยใช้จมูกดุด หมู ว. ง่าย, สะดวก หมู น. ใบพลูสดหั่นผสมฝิ่นแล้วนำมาสูบ หมู น. ชื่อเรือขุดชนิดหนึ่งใช้ในแม่น้ำลำคลอง หมู น. ชื่อขวานชนิดหนึ่ง ด้ามสั้น สันหนา มีบ้องยาวตามสัน ใช้ตัด ถาก และฟัน Klang Kam ใส, หมู, หวาน, หวานหมู ง่าย (Difficulty & Ease) หมู; สุกร; วราห์, วราหะ สัตว์สี่เท้าซึ่งเท้ามีกีบ ใช้จมูกดุนหาอาหารมนุษย์เลี้ยงไว้เป็นอาหาร (Pig) หมู ใบพลูแห้งหั่นเป็นฝอยคลุกน้ำฝิ่นปั้นก้อนใช้สูบด้วยบ้องกัญชา (Drug) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
The example Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Lexicon Information Integration is a set of words w1 , w2 , ... , wn consisted in the definition within Royal Institute Dictionary. is a set of words w1 , w2 , ... , wn consisted in the definition within Klang Kam. is word sense pair that has maximum score. is the definition of word from Klang Kam. is the definition of word from Royal Institute Dictionary. is all definitions of word from Royal Institute Dictionary. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
c1 ... เรือ ขุด wn น้ำ เรือ โคน weight(เรือ, c1) weight(ขุด, c1) ... weight(wn, c1) ปาก กราบ แบน ... เรือ ขุด wn cm c2 weight(เรือ, c2) weight(ขุด, c2) ... weight(wn,c2) อดทน รับ ลาก ยาก กลาง คู่ ... เรือ ขุด wn ลำบาก ปลา ติด คอ ฟาก ใจ weight(เรือ, cm) weight(ขุด, cm) ... weight(wn, cm) The Score of each concept = Word classification based on definition-concept category matching Word : หมู (moo) Definition : เรือ ขุด ชนิด หนึ่ง ... (a kind of boats) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Lexicon Information Integration is a frequency of word “w” in the concept category “c”. is a number of concept category “c” that contained the word “w” inside. Ncis a number of all concept category. n is a number of word consisted in the definition of all word in the concept. Score is a score of concept. is any concept category. is a word consisted in the definition. is a set of all words consisted in the definition. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
The example of our result Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Validation & Manipulation • Manually correct word and theirs information into the proper concept for word classification based on head word matching. • Choose the proper concept from the result of word classification by the lexicographer for word classification based on definition-concept category matching. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Experiment & Evaluation Word classification based on head word matching • random138 words (200 senses) from Royal Institute Dictionary. • The correctness of this technique is 91.50%. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Experiment & Evaluation Word classification based on definition-concept category matching • random120 words from Royal Institute Dictionary. • The result of this technique is the top ten rank of concept. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Experiment & Evaluation Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Conclusion & Future Work • We propose a framework for semi-automatic Thai computational lexicon construction from two resources: • Klang Kam • Royal Institute Dictionary • KULEX also has heterogeneous word information. • This approach greatly reduce labor work and time consuming. • In the future, we will improve the algorithm of word classification, augment other word information such a s case frame and apply part of speech of us instead. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Reference [1] WordNet, Available at http://wordnet.princeton.edu/ [2] Roget's Thesaurus, Available at http://thesaurus.reference.com/ [3] TCL’s Computational Lexicon, Available at http://www.tcllab.org/tcllex/ [4] Lexitron, Available at http://lexitron.nectec.or.th/ [5] Asanee Kawtrakul, Supapas Kumtanode, Thitima Jamjanya, and Chanvit Jewriyavech. “A Lexibase Model for Writing Producting Assistant System” The 2nd Symposium on Natural Language Processing. August 2-4, 1996. Copenhagen, Denmark Copenhagen, Denmark. [6] ArnThai2.5 (Lite Version), Available at http://arnthai.links.nectec.or.th Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Discussion • The causes of error • The definition pairs which have the same sense but contained difference surface form such as the words “ผสม(pa-som)” and “คลุก(kloog)” that both mean “to mix”. This causes the wrong word matching and weight computing. • The definitions have the short description or use synonym to be a description. For example, the word “นม(nom)” (milk) describe that “น้ำนม(nam-nom)” (milk). It also effects to word matching and weight computing. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Comparing of Word Classification • Classifying 20 of random words by linguistic. • Don’t use any tool : average time is 2.03 sec/word. • Using our methodology: average time 0.50 sec/word. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Experiment & Evaluation of Word Information Parsing • Klang Kam • Random 205 entries. • The correctness of parsing is 94.63%. • Royal Institute Dictionary • Random 205 entries. • The correctness of parsing is 91.21%. Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University
Meaning of Computational Lexicon • “The computational lexicon is the fundamental repository of information about the primary component of language, i.e. words, and therefore critical for systems which aim to handle some aspect of natural language” (Cornelia, 1997) Specialty Research Unit in Natural Language Processing & Intelligent Information System Technology Department of Computer Engineering, Kasetsart University