1 / 75

Tools and Interfaces for Wordnet construction, linking and maintenance

Tools and Interfaces for Wordnet construction, linking and maintenance. Abhishek G. Nanda 03005031 Under the guidance of: Prof. Pushpak Bhattacharyya. Wordnet. Language - Means of communication using encoded information Words - Units used for communicating information

dior
Download Presentation

Tools and Interfaces for Wordnet construction, linking and maintenance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tools and Interfaces for Wordnet construction, linking and maintenance Abhishek G. Nanda 03005031 Under the guidance of: Prof. Pushpak Bhattacharyya

  2. Wordnet • Language - Means of communication using encoded information • Words - Units used for communicating information • Semantics - Meanings of words and word forms

  3. Wordnet • Dictionary - List of alphabetically arranged words with meanings • Thesaurus - List of alphabetically arranged concepts with word forms What is Wordnet?

  4. Wordnet • Lexical database of words • Arranged based on concepts • Grouped based on synonymy • Synonymy - Property of different words sharing same meaning in a context. Eg. buy and purchase • Polysemy - Property of words having different meanings based in different contexts. Eg. bank as financial institution and as river bank

  5. Wordnet - Lexical Matrix

  6. Wordnet - Relations • Semantic Relations • Hypernymy and Hyponymy • Meronymy and Holonymy • Entailment • Troponymy • Coordinate terms • Lexical Relations • Antonymy • Gradation

  7. Wordnet - Relations • Hypernymy and Hyponymy • is a kind of • leaf is the hypernym of neem leaf • neem leaf is the hyponym of leaf • Meronymy and Holonymy • part-whole • root is the meronym of tree • tree is the holonym of root

  8. Wordnet - Relations • Entailment • implication • snore entails sleep • Troponymy • manner elaboration • roar is the troponym of speak • Coordinate terms • Common hypernym • wolf and dog are coordinate terms

  9. Wordnet - Relations • Antonymy • opposites • fat is the antonym of thin • Gradation • Intermediate concepts in antonymy • morning -> noon -> evening

  10. Wordnet - Wordnets • PWN - Princeton WordNet for English language • EuroWordNet - Wordnet for European languages • HWN - Hindi Wordnet for Hindi language

  11. Hindi Wordnet • Relations borrowed - synynymy, hypernymy, holonymy, troponymy, entailment, etc. • Defines 8 part-whole relationships • Defines 3 types of antonymy relations • Gradable antonym (गर्म-ठंडा) • Complementary antonym (जीवित-मृत) • Converse antonym (लेना-देना)

  12. Hindi Wordnet • Gradation • Intermediate terms • Pre-Intermediate terms • Post-Intermediate terms • Eg. सूखा - शुष्क - नम - तर - गीला • 10 domains of interpretation. Eg. State, Size, Gender, etc.

  13. Hindi Wordnet - Verbs • Simple Verb - One root. Eg. खाना • Compound Verb - Made up of another POS. Eg. मीठा लगना • Combination Verb - Made of related two verbs. Eg. पढ़ना-लिखना • Onomatopoeic Verb - Eg. खटखटाना from खटखट • Conjunct Verb - Hidden sense of action. Eg. ले जाना

  14. Hindi Wordnet - Verbs • Causative verbs • First causative verb - Eg. सुलाना(to make somebody sleep) • Second causative verb - Eg. सुलवाना (to make somebody sleep through the effort of a third person)

  15. Hindi Wordnet - Creation Principles for Wordnet creation • Minimality - Minimal set. Eg. {घर, कमरा, कक्ष} • Coverage - Coverage of words. Eg. {घर, कमरा, कक्ष} • Replaceability - Mutual replaceability in a context. Eg. अमेरिका में दो साल बिताने के बाद श्याम स्वदेश/घर लौटा

  16. Sanskrit Wordnet Concept-based Multilingual dictionary • Need • Loss of synonymy when moving across languages. Eg. dark and evil are synonymous in English but counterparts अंधेरा and दुष्ट are not. • Number of lexicographers required - O(n2)

  17. Sanskrit Wordnet - Concept based Multilingual dictionary

  18. Sanskrit Wordnet - Challenges Observed during construction of Marathi Wordnet: • Single word to synthetic expression. Eg. bankrupt -> दिवाला निकालना • Culture specific concepts. Eg. girlfriend. Requires transliteration such as महिलामित्र • Splitting of concepts. Eg. फ़ीका (tasteless) in Hindi -> अगोड (less sweet), अळणी (less salty), मिळमिळत (less spicy) in Marathi

  19. Sanskrit Wordnet - Challenges Observed during Indo Wordnet workshop at Coimbatore, June 2009: • Varied usage across regions and people. Eg. In Kashmiri, separate words for drinking water and water in Muslim community but one word in hindu community. • Single-word and multi-word expressions in same language. Eg. In Nepali, मोह and मोह-माया both mean infatuation.

  20. Sanskrit Wordnet - Sanskrit • Indo-Aryan language • Hinduism • Buddhism • Classical Sanskrit - Panini • Vedic Sanskrit - pre-Classical

  21. Sanskrit Wordnet - Sanskrit Etymology • Etymology of Verbs • गण - Ten classes based on how stem is generated • इट् - Three groups based on position of tense marker • उपसर्ग - 22 prepositional particles that modify a root

  22. Synset Marking • Grouping of synsets based on frequency of occurrence and usage in language • Universal concepts • who and what • honesty

  23. SynsetMarker - Interface

  24. SynsetMarker - Features • Display of synset fields • Browsing • Search • Word • ID • Marking - Universal, Common, Common in Hindi and Uncommon • Save/Exit • Shortcuts

  25. SynsetMarker - API • records • DefineRecord • SynsetRecord • operations • SynsetOperator • RecordReader • RecordWriter • gui • Interface

  26. SynsetMarker - Process • First round divided among 6 people • 31000 synsets marked • Universal and Common clubbed - 15234 synsets • Common in Hindi - 6771 synsets • Uncommon - 10987 synsets • Second round voting schema • Common - 13205 synsets

  27. Core Synset Selection • Bharatiya Vyavahara Kosh • English and 15 Indian languages • 2000 concepts with domains • खेल (game), प्राणी (animal), फल (fruit) • Link synsets to words in Kosh • Polysemy • अनन्नास as pineapple fruit • अनन्नास as pineapple plant

  28. DomainClassifier - Interface

  29. DomainClassifier - Features • Display of synset fields • Browsing through records • Marking right synset for a word and a domain • Save/Export

  30. DomainClassifier - API • records • DefineRecord • SynsetRecord • operations • SynsetOperator • RecordReader • RecordWriter • gui • Interface

  31. DomainClassifier - Process • Groupings • Single IDs • Multiple IDs • No IDs • Rounds of marking • Common synsets • Common in Hindi synsets • Uncommon synsets

  32. DomainClassifier - Process • End of process • Core - 1969 synsets • Common - 11658 synsets

  33. Online SynsetMarker - Interface

  34. Online SynsetMarker - Interface

  35. Online SynsetMarker - API Written in PHP • login.php - Interface to login as a user or as an admin or to register as a new user • process.php - To process login/register data and accordingly direct a user • logout.php - To logout a user • mainprocess.php - Processing of data to display unmarked synset • main.php - Display of synset with buttons to mark as Common or Uncommon • admin.php - Admin page with statistical data of number of marked synsets per user and number of users based on synset marks • adminpassword.php - Password interface to login as admin • adminuserprofile.php - Profile data of a particular user

  36. Online SynsetMarker - Process • Threshold for dropping synset as Uncommon • Had to be set to 1 • Common - 10312 synsets

  37. Sanskrit Wordnet Interface • Interface for creation of Sanskrit Wordnet • Based on idea of Concept-based Multilingual dictionary

  38. User Interface - Configure

  39. User Interface - Main

  40. User Interface - Panels • Help Panel: Buttons for Commenting, Synchronizing and References tool. • Search Panel: Search word or ID or perform advanced search. Font increase/decrease. • Synset Panels: Synset data fields and completion status. • Tool Panel: English synset, Link tool, Etymology tool. • Browse Panel: Browsing through records, saving and exiting.

  41. User Interface - Features - Reference tool

  42. User Interface - Features - Synchronize tool

  43. User Interface - Features - Advanced Search

  44. User Interface - Features - English synsets tool

  45. User Interface - Features - Link tool

  46. User Interface - Features - Etymology tool

  47. User Interface - Features - Keyboard Shortcuts • Undo feature - Monitor keyboard actions and undo on Ctrl-Z • Saving feature - Monitor change in field values and save on Ctrl-S • Search - Ctrl-F for quick search access

  48. Interface API Problems and Requirements • Huge volumes of data (eg. 30,000 synsets) • Links between different data • Efficient and user-friendly GUI • Sufficient querying • Grouping • Review separation

  49. Interface API

  50. Graphical User Interface JButton saveButton = null; public JButton getSaveButton() { if (saveButton == null) { saveButton = new JButton(); } return saveButton; }

More Related