WordNet

WordNet WordNet, WSD

WordNet • What is WordNet? • Miller 95: “WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets.”

WordNet • Go to the main WordNet site: http://wordnet.princeton.edu/ • Open the wordnet folder on pongo: ~/dropbox/570/wordnet/dict

WordNet Vocabulary • See glossary at: http://wordnet.princeton.edu/gloss • synset: A synonym set; a set of words that are interchangeable in some context • lemma: lower case ASCII text of word as found in the WordNet database index files • lexical pointer: A lexical pointer indicates a relation between words in synsets

Navigating WordNet files • data.* files – the actual network files (synsets) • index.* files – contains lower case instances of all words in WordNet, with pointers to the synset entries in the network

WordNet data file Synset file offset 00045430 04 n 01 performance 3 003 @ 00033580 n 0000 ~ 00045680 n 0000 ~ 00045874 n 0000 | any recognized accomplishment; "they admired his performance under stress“ 00045680 04 n 01 overachievement 0 003 @ 00045430 n 0000 + 02537922 v 0101 ! 00045874 n 0101 | better than expected performance (better than might have been predicted from intelligence tests) Synset type File number # words in synset word # pointers to other synsets Type of pointer POS Pointer See: wndb

Pointer symbols • For nouns: ! Antonym @ Hypernym ~ Hyponym #m Member holonym #s Substance holonym #p Part holonym %m Member meronym %s Substance meronym %p Part meronym = Attribute + Derivationally related form See: wninput

WordNet index file abomination n 3 2 @ + 3 0 09613960 07401317 00734041 lemma (word) POS # pointers pointers synset file offset # synsets

WordNet tools • Many, many tools • General documentation: http://wordnet.princeton.edu/doc • Online query and lookup: http://wordnet.princeton.edu/perl/webwn • APIs and tools: http://wordnet.princeton.edu/links • WordNet::similarity: http://wn-similarity.sourceforge.net/ • WordNet::similarity web interface: http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi

WordNet and WSD • Milhalcea 2002 describes system to sense encode text using WordNet (and related tools and resources)

Milhalcea 2002 • Some tools and resources described: • Senseval • http://www.senseval.org/ • Evalutation exercises for Word Sense Disambiguation • Senseval-1 – 3, held in last several years, workshops at ACL • Senseval-4 coming up • Data and materials from Senseval-3 can be downloaded • Some useful materials for multiple languages • Materials and test data for English, Italian, Basque, Catalan, Chinese, Romanian, and Spanish

Milhalcea 2002 • Some tools and resources described: • Semcor • Sense tagged Brown corpus • Created at Princeton • Used for training WSD systems • Can be downloaded from Milhalcea’s web site: http://www.cs.unt.edu/~rada/downloads.html • We’re also planning on installing it on Pongo

McCarthy et al 2004 • Task: find the predominant word senses in untagged text • Unlike Milhalcea 2002, did not rely on supervised method using SemCor • Built a thesaurus from raw text and Wordnet • Intuition: word sense more likely to be determined from untagged corpus from context, affected by genre, domain or text type • Rather than relying on SemCor’s 250,000 words, where the word senses are rather limited

McCarthy et al • Thesaurus development relies on dependencies between “neighbors” • Look at distributional similarities between a word and its neighbors

McCarthy et al • Experimented with several similarity measures available in WordNet::similarity • First experiment used SemCor to see how well the unsupervised system worked • 2595 polysemous nouns in SemCor

McCarthy et al • Experiment #2 against SENSEVAL-2 English All Words Data • Comparison between the precision and recall for SemCor vs. their automatic data (and the SENSEVAL ceiling)

McCarthy et al • Some experiments with domain specific corpora gave these results:

WordNet

WordNet

Presentation Transcript

MARATHI WORDNET

WordNet: An Overview

WORDNET

Latin WordNet project

Indo WordNet A WordNet for Hindi

Punjabi WordNet

Malayalam WordNet

Kannada WordNet

WordNet Enhancements:

WordNet

WordNet CMS

Dravidian WordNet

Grouping WordNet Senses

Punjabi WordNet Development

WordNet::Similarity

Indo WordNet A WordNet for Hindi

Wordnet, EuroWordNet, Global Wordnet

WordNet and Extended WordNet

DRAVIDIAN WORDNET