100 likes | 117 Views
Develop a program for unsupervised learning of morphological rules in Hindi text to enhance vocabulary analysis and comprehension.
E N D
Unsupervised Morphological Analysis of Hindi Aditi Krishn Rabi Shanker Guha Advisor: Dr. Amitabh Mukherjee Department of Computer Science and Engineering IIT Kanpur Course Project for CS365: Introduction to Artificial Intelligence
GOAL To develop a program that learns the structure of words, ‘MORPHOLOGICAL RULES’ in Hindi Language on the basis of a raw text. • INPUT: Raw untagged corpus with spaces separating symbols (no text preparation). • OUTPUT: • (Stems) X (Signature) • Further collapsing similar words into a single stem (Stem) X (rule1: signature) (rule2:signature) • Reduce the MDL of the grammar. • Unsupervised Learning: no prior knowledge
Motivation • Better reading and comprehension of text. • Combining prefixes, suffixes and root leads to overall decrease in the Minimum Description Length of the vocabulary. • Helps in deciphering unfamiliar vocabulary.
Derivational happy - happiness Morphology Morpheme Inflectional dog -dogs Example of morphology in Hindi: Other Morphological Forms: (Dissimilar Spelling)*
Previous Work • LINGUISTICA: John Goldsmith • Linguistica is a program designed to explore the unsupervised learning of natural language • Explores the possibility of morpheme combinations for the set of words based on no internal knowledge of the language. • PRIORS IN BAYESIAN LEARNING OF PHONOLOGICAL RULES: Sharon Goldwater and Mark Johnson • Describes a method for unsupervised learning of phonological rules from unlabelled corpus. • Contains a set of phonological rules, insertion, deletion and substitution, which further collapse the grammar.
Linguistica: Algorithm Corpus Is the modification an improvement? Ask MDL! Bootstrap heuristic Morphology modified morphology incremental heuristics
Application on Hindi • आर्य NULL,ों, • उठ-------------------------------------------- • -- NULL,ता,ी, • -ा NULL,एं,ने, • ----------------------------------------------------- • उत्सवों-महोत्सव ,ों, • मर----------------------------------------------- • ा- NULL ,आ, ना • -- NULL,आ, ना, वाया • ---------------------------------------------------- • आर्य NULL,ों, • उठ NULL,ता,ी, • उठा NULL,एं,ने, • उत्सवों-महोत्सव ,ों, • मार NULL, आ, ना • मरNUL, आ, ना, वाया • छिड़काव NULL,ों, • जा NULL,एं,ता,ने, • जी NULL,ने, • जात ,ियों,ी, • छिड़काव NULL,ों, • जा NULL,एं,ता,ने, • जी NULL,ने, • जात ,ियों,ी,
Result Analysis • The results have been obtained by running Linguistica on IIT-B CFLIT corpus. • Out of 249 stem 6 were False positive.(2.4% incorrect) • False negative: some stems whose word count is not enough in the corpus are not displayed. • Complex structural morphemes e.g.: चल रहा do not appear as they are separate words. • On running our algorithm on Linguistica outputs we get 4 out of 7 correct grouping.
References • Goldsmith, John. (ms. 2004) An algorithm for the unsupervised learning of morphology. • Sharon Goldwater and Mark Johnson,Priors in Bayesian Learning of Phonological Rules. • http://linguistica.uchicago.edu/ • ’Hindi Morphology’ by Rajendra Singh and R.K. Agnihotri • Goldsmith, John. 2001. Unsupervised Learning of the Morphology of a Natural Language. • Goldsmith, John. 2000. Linguistica: An Automatic Morphological Analyzer. • http://www.uknow.gse.harvard.edu/teaching/TC102-407.html