770 likes | 916 Views
Tools and Interfaces for Wordnet construction, linking and maintenance. Abhishek G. Nanda 03005031 Under the guidance of: Prof. Pushpak Bhattacharyya. Wordnet. Language - Means of communication using encoded information Words - Units used for communicating information
E N D
Tools and Interfaces for Wordnet construction, linking and maintenance Abhishek G. Nanda 03005031 Under the guidance of: Prof. Pushpak Bhattacharyya
Wordnet • Language - Means of communication using encoded information • Words - Units used for communicating information • Semantics - Meanings of words and word forms
Wordnet • Dictionary - List of alphabetically arranged words with meanings • Thesaurus - List of alphabetically arranged concepts with word forms What is Wordnet?
Wordnet • Lexical database of words • Arranged based on concepts • Grouped based on synonymy • Synonymy - Property of different words sharing same meaning in a context. Eg. buy and purchase • Polysemy - Property of words having different meanings based in different contexts. Eg. bank as financial institution and as river bank
Wordnet - Relations • Semantic Relations • Hypernymy and Hyponymy • Meronymy and Holonymy • Entailment • Troponymy • Coordinate terms • Lexical Relations • Antonymy • Gradation
Wordnet - Relations • Hypernymy and Hyponymy • is a kind of • leaf is the hypernym of neem leaf • neem leaf is the hyponym of leaf • Meronymy and Holonymy • part-whole • root is the meronym of tree • tree is the holonym of root
Wordnet - Relations • Entailment • implication • snore entails sleep • Troponymy • manner elaboration • roar is the troponym of speak • Coordinate terms • Common hypernym • wolf and dog are coordinate terms
Wordnet - Relations • Antonymy • opposites • fat is the antonym of thin • Gradation • Intermediate concepts in antonymy • morning -> noon -> evening
Wordnet - Wordnets • PWN - Princeton WordNet for English language • EuroWordNet - Wordnet for European languages • HWN - Hindi Wordnet for Hindi language
Hindi Wordnet • Relations borrowed - synynymy, hypernymy, holonymy, troponymy, entailment, etc. • Defines 8 part-whole relationships • Defines 3 types of antonymy relations • Gradable antonym (गर्म-ठंडा) • Complementary antonym (जीवित-मृत) • Converse antonym (लेना-देना)
Hindi Wordnet • Gradation • Intermediate terms • Pre-Intermediate terms • Post-Intermediate terms • Eg. सूखा - शुष्क - नम - तर - गीला • 10 domains of interpretation. Eg. State, Size, Gender, etc.
Hindi Wordnet - Verbs • Simple Verb - One root. Eg. खाना • Compound Verb - Made up of another POS. Eg. मीठा लगना • Combination Verb - Made of related two verbs. Eg. पढ़ना-लिखना • Onomatopoeic Verb - Eg. खटखटाना from खटखट • Conjunct Verb - Hidden sense of action. Eg. ले जाना
Hindi Wordnet - Verbs • Causative verbs • First causative verb - Eg. सुलाना(to make somebody sleep) • Second causative verb - Eg. सुलवाना (to make somebody sleep through the effort of a third person)
Hindi Wordnet - Creation Principles for Wordnet creation • Minimality - Minimal set. Eg. {घर, कमरा, कक्ष} • Coverage - Coverage of words. Eg. {घर, कमरा, कक्ष} • Replaceability - Mutual replaceability in a context. Eg. अमेरिका में दो साल बिताने के बाद श्याम स्वदेश/घर लौटा
Sanskrit Wordnet Concept-based Multilingual dictionary • Need • Loss of synonymy when moving across languages. Eg. dark and evil are synonymous in English but counterparts अंधेरा and दुष्ट are not. • Number of lexicographers required - O(n2)
Sanskrit Wordnet - Challenges Observed during construction of Marathi Wordnet: • Single word to synthetic expression. Eg. bankrupt -> दिवाला निकालना • Culture specific concepts. Eg. girlfriend. Requires transliteration such as महिलामित्र • Splitting of concepts. Eg. फ़ीका (tasteless) in Hindi -> अगोड (less sweet), अळणी (less salty), मिळमिळत (less spicy) in Marathi
Sanskrit Wordnet - Challenges Observed during Indo Wordnet workshop at Coimbatore, June 2009: • Varied usage across regions and people. Eg. In Kashmiri, separate words for drinking water and water in Muslim community but one word in hindu community. • Single-word and multi-word expressions in same language. Eg. In Nepali, मोह and मोह-माया both mean infatuation.
Sanskrit Wordnet - Sanskrit • Indo-Aryan language • Hinduism • Buddhism • Classical Sanskrit - Panini • Vedic Sanskrit - pre-Classical
Sanskrit Wordnet - Sanskrit Etymology • Etymology of Verbs • गण - Ten classes based on how stem is generated • इट् - Three groups based on position of tense marker • उपसर्ग - 22 prepositional particles that modify a root
Synset Marking • Grouping of synsets based on frequency of occurrence and usage in language • Universal concepts • who and what • honesty
SynsetMarker - Features • Display of synset fields • Browsing • Search • Word • ID • Marking - Universal, Common, Common in Hindi and Uncommon • Save/Exit • Shortcuts
SynsetMarker - API • records • DefineRecord • SynsetRecord • operations • SynsetOperator • RecordReader • RecordWriter • gui • Interface
SynsetMarker - Process • First round divided among 6 people • 31000 synsets marked • Universal and Common clubbed - 15234 synsets • Common in Hindi - 6771 synsets • Uncommon - 10987 synsets • Second round voting schema • Common - 13205 synsets
Core Synset Selection • Bharatiya Vyavahara Kosh • English and 15 Indian languages • 2000 concepts with domains • खेल (game), प्राणी (animal), फल (fruit) • Link synsets to words in Kosh • Polysemy • अनन्नास as pineapple fruit • अनन्नास as pineapple plant
DomainClassifier - Features • Display of synset fields • Browsing through records • Marking right synset for a word and a domain • Save/Export
DomainClassifier - API • records • DefineRecord • SynsetRecord • operations • SynsetOperator • RecordReader • RecordWriter • gui • Interface
DomainClassifier - Process • Groupings • Single IDs • Multiple IDs • No IDs • Rounds of marking • Common synsets • Common in Hindi synsets • Uncommon synsets
DomainClassifier - Process • End of process • Core - 1969 synsets • Common - 11658 synsets
Online SynsetMarker - API Written in PHP • login.php - Interface to login as a user or as an admin or to register as a new user • process.php - To process login/register data and accordingly direct a user • logout.php - To logout a user • mainprocess.php - Processing of data to display unmarked synset • main.php - Display of synset with buttons to mark as Common or Uncommon • admin.php - Admin page with statistical data of number of marked synsets per user and number of users based on synset marks • adminpassword.php - Password interface to login as admin • adminuserprofile.php - Profile data of a particular user
Online SynsetMarker - Process • Threshold for dropping synset as Uncommon • Had to be set to 1 • Common - 10312 synsets
Sanskrit Wordnet Interface • Interface for creation of Sanskrit Wordnet • Based on idea of Concept-based Multilingual dictionary
User Interface - Panels • Help Panel: Buttons for Commenting, Synchronizing and References tool. • Search Panel: Search word or ID or perform advanced search. Font increase/decrease. • Synset Panels: Synset data fields and completion status. • Tool Panel: English synset, Link tool, Etymology tool. • Browse Panel: Browsing through records, saving and exiting.
User Interface - Features - Keyboard Shortcuts • Undo feature - Monitor keyboard actions and undo on Ctrl-Z • Saving feature - Monitor change in field values and save on Ctrl-S • Search - Ctrl-F for quick search access
Interface API Problems and Requirements • Huge volumes of data (eg. 30,000 synsets) • Links between different data • Efficient and user-friendly GUI • Sufficient querying • Grouping • Review separation
Graphical User Interface JButton saveButton = null; public JButton getSaveButton() { if (saveButton == null) { saveButton = new JButton(); } return saveButton; }