270 likes | 457 Views
Indo WordNet A WordNet for Hindi. Debasri Chakrabarti, Dipak Kumar Narayan, Prabhakar Pandey, Madhu Prasad Sharma. Centre for Technology Development for Indian Languages Computer Science and Engineering Department, IIT Bombay. Introduction. WordNet – A lexical database
E N D
Indo WordNet A WordNet for Hindi Debasri Chakrabarti, Dipak Kumar Narayan, Prabhakar Pandey, Madhu Prasad Sharma Centre for Technology Development for Indian Languages Computer Science and Engineering Department, IIT Bombay
Introduction • WordNet – A lexical database • Searching the dictionary conceptually • Different organizing principle for different syntactic category • Synsets or the Synonymy Sets are the basic building blocks • Lexical knowledge base is the heart of any intelligent information processing system
WordNet for Hindi • Hindi WordNet is an on-line lexical database for Hindi language • Design has been inspired by the famous English WordNet • Unique features • Graded antonyms and meronymy relationships • Efficient underlying database design • Cross part of speech linkage
Semantic relations in WordNet • Synonymy • Hypernymy / Hyponymy • Antonymy • Meronymy / Holonymy • Gradation • Entailment • Troponymy
Semantic Relations • Synonymy • True synonyms are rare • Synonymy related to a context • {Gar ‚ kmara} • {Gar ‚ Aavaasa} • {Gar ‚ janmakuMDlaIya sqaana} • {Gar ‚ svadoSa}
Semantic Relations • Hypernymy and Hyponymy • Relation between word meaning (synsets) • X is a hyponym of Y if X is a kind of Y • Hyponymy is transitive and asymmetrical • Hypernymy is inverse of Hyponymy lionanimalliving entityentity Saor pSau sajaIva Aist%va
Semantic Relations • Antonymy • Oppositeness in meaning • Relation between word forms • Meronymy and Holonymy • Part-whole relation, branch is a part of tree • X is a meronymy of Y if X is a part of Y • Meronym is transitive and asymmetrical • Holonymy is inverse relation of Meronymy
Troponym and Entailment • Entailment • { Kra-Ta laonaa – saaonaa £ • Troponym • { laÐgaD,anaa ‚ kdmatala krnaa – calanaa £ • ¡ fusafusaanaa – baaolanaa £
Classification of verbs • Simple verbs (sarla iËyaa): saaonaa‚ Kanaa • Conjunct verbs (saMyau@t iËyaa) • Compound verbs (samaaisak iËyaa) Á Kanaa–pInaa • Causative verbs (p`orNaa%mak iËyaa) Á saulavaanaa
WordNet Sub-Graph saMrcanaa Hyponymy Aavaasa , inavaasa Hypernymy Meronymy rsaao[-Gar Hyponymy Aa^Mgana Sayana kxa M e r o n y m y Gar , gaRh Gloss baramada manauYyaaoM ka Cayaa huAa vah sqaana jaao dIvaaraoM sao Gaor kr banaayaa jaata hO Hyponymy AQyana kxa Aitiqa gaRh AaEama JaaopD,I
Design and Implementation • Basic relations or lexical links are between synonym sets • Lexical database is stored in MySQL package • Sub-tasks identified • Database design • Data entry interface • Implementation of Organizer Utility • Application programs to access and display the information in the lexical database
Data Entry Interface • GUI designed in Java/JFC • Separate screen for data entry of different categories • Automatic generation of synset id’s • Screen to view the entered data
Organizer Utility • Designed to preprocess the data • Reflexive pointers are generated • e.g. if A hypernym of B then B hyponym of A is automatically generated • Each semantic relation is mapped to a separate table (normalized) • Font conversion • Roman Hindi DV-TTYogesh
Storage Structure • Relation between Synsets • tblNounHypernyms • Relation between Word-forms • tblNounAntonyms
System Statistics • Over 8500 synsets entered in the database • MySQL used as the back-end database server • Data entry interface designed in Java/JFC • Organizer utility written in perl • Web based data retrieval system developed in HTML and PHP • DV-TTYogesh Font used to display Hindi Text
Application of WordNet • Word Sense Disambiguation • Interface to Internet Search Engines • Text classification • Information Retrieval system • Document Similarity
Conclusion • The structure of Hindi Language have been studied and new features have been introduced in the Hindi WordNet • Currently over 8500 synsets have been inserted into the database • The MySQL database has been found to be quite efficient • The web interface for querying the lexical database is under continuous evolution