Centre for Excellence in Computational Engineering and Networking (CEN ),‏ Amrita Vishwa Vidyapeetham

Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

A Novel Approach to Morphological Analysis for Tamil Language Presented by M.Anand Kumar V.Dhanalakshmi CEN, Amrita. Guided by Dr.K.P.Soman Head, CEN Amrita University. Dr.S.Rajendaran Head, Dept.Linguistics Tamil University. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Overview • Introduction • Tamil Morphological analyzer • Machine Learning(SVM) • Formulation(data creation) • GUI • Conclusion Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Introduction • Grammar of any Language can be broadly divided into Morphology and Syntax. • Morphology studies deal with words and their construction. Syntax deals with how to put the words together in some order to make meaningful sentences Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Introduction cont... • Morphological analysis is the process of segmenting words into morphemes and analyzing the word formation. • Morphemes are smallest meaning bearing unit. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Introduction cont... • Morph analyzer is the basic tool towards building word processing tools like spell checkers, grammar checkers etc. • They are also the first step towards NLP systems that do deep linguistic processing like natural language interfaces, machine translation, search engines, text summarization, information extraction and story understanding system. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

TamilMorphologicalAnalyzer • Tamil language is morphologically rich and agglutinative. • Each root word is affixed with several morphemes. • Individual Analyzers for Noun and Verb. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

TamilNoun • Noun Root + Case Marker Euphonic Increment Plural Marker Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

TamilNoun Examples • வண்டுகளுக்கு = வண்டு+கள்+உக்கு • ஆற்றில் = ஆறு+ற்+இல் • மரத்தை = மரம்+த்+ஐ Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

TamilVerb • Verb Root+ Tense Marker PNG marker Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

TamilVerb Examples • படித்தான் = படி+த்த்+ஆன் • போனாள் = போ+இன்+ஆள் • வருகின்றேன் = வா+கின்ற்+ஏன் Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

MachineLearning • It is a branch of Artificial Intelligence concerned with the design of algorithms that learn from the examples. • A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data. • Computers are made to learn based on data. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

MachineLearning • In machine learning all the rules including complex spelling rules are also handled based on classification. • Machine learning approaches don’t require any hand coded morphological rules . • Machine learning approaches are directly applied to all the natural language processing tasks. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

MachineLearning • For any machine learning approaches data creation plays the key role. • It needs only corpora with linguistical information. The morphological or linguistical rules are automatically extracted from the annotated corpora. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Support Vector Machine • Support vector approaches have been around since the mid 1990s. • Support Vector Machine is a approach to supervised pattern classification which has been successfully applied to a wide range of classification problems. • Morphological Analyzer problem is converted into classification problem. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

SVMTool • SVMTool is an open source generator of sequential taggers based on Support Vector Machine. • Generally SVMTool is developed for POS tagging but here this tool is used in morphological analyzer for classification task. • The SVMTool software package consists of three main components, namely the model learner (SVMTlearn), the tagger (SVMTagger) and the evaluator (SVMTeval). Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

DrawbacksinRuleBasedSystem • Rule based approach. • Set of rules and dictionaries. • Morphemes Dictionaries. • In rule based approaches each rule works on the output of previous rule. So if one rule fails, it will affect the entire rule that follows. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Formulation-Data Creation • Classification for verbs and nouns . • 32 paradigms for Verb • 25 for Noun . • 563 inflections for Verb • 313 for Noun . Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Formulation-Data Creation • Classification for verbs and nouns . VERB NOUN 25 for Noun 313 Inflections for Noun 32 paradigms for Verb 563 inflections for Verb Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Models • Two Models. • Model-I  Segmentation of Morphemes • Model-II  Grammatical Tagging of Morphemes and also to handle the morphotactics Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Data Creation for Model-I • Romanization • Grapheme Segmentation. • Splitting Syllable. • Consonant-Vowel Representation. • Segmentation Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Input Data Formulation Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Sample Data • Model –I Training Data • Input -- Characters with Consonant-Vowel representation. • Output-- Characters with Morpheme Boundaries Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Sample Data • Model –I (Segmentation of Morphemes)Training Data .of Morphemes Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Model-I Input : padiththAn. Output : padi*thth*An*. Model I- Segment the morphemes from the given word Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Model-II (Sample Data ) • Model –II Training Data • Input -- Morphemes • Output– Grammatical Category of Morpheme in a morphotactics order Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

EXAMPLE • Model –II (Morpheme Tagging) Training Data. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Model-II Input : padi*thth*An*. Output : padi <Verb> thth <Past> An <3sm> Model II- Identifies the grammatical category of the morphemes in a morphotactics order Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Implementation Pre-processing Segmentation of morphemes Identifying morpheme • Morphological analyzer is redefined as a classification task . Generally there are three steps in morphological analyzer. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Implementation Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Conclusion • This Novel approach for Morphological analyzer based on Machine learning gives better result . • The corpus created for this purpose can be used for developing Tamil spell checker and text processing tools. • We are currently implementing the same methodology for other Dravidian languages. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

References • Anandan. P, Ranjani Parthasarathy, Geetha T.V.2002. Morphological Analyzer for Tamil, ICON 2002, RCILTS-Tamil, Anna University, India. • Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.)2004 , Morphology. A Handbook on Inflection and Word Formation, Berlin and New York: Walter De Gruyter, 1893-1900 • Dhanalakshmi V, Anandkumar M, Vijaya M.S, Loganathan R, Soman K.P, Rajendran S,2008, Tamil Part-of-Speech tagger based on SVMTool, Proceedings of the COLIPS International Conference on Asian Language Processing 2008 (IALP), Chiang Mai, Thailand. 2008: 59-64. • Jesús Giménez and Llu´ıs Màrquez,2006 SVMTool:Technical manual v1.3, August 2006. • John Goldsmith. 2001. Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics, 27(2):153–198. • Rajendran, S., Arulmozi, S., Ramesh Kumar, Viswanathan, S. 2001. Computational morphology of verbal complex. Paper read in Conference at Dravidan University, Kuppam, December 26-29, 2001. Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

நன்றி Centre for Excellence in Computational Engineering and Networking (CEN),‏ Amrita VishwaVidyapeetham

Centre for Excellence in Computational Engineering and Networking (CEN ),‏ Amrita Vishwa Vidyapeetham