E N D
WordNet WordNet, WSD
WordNet • What is WordNet? • Miller 95: “WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets.”
WordNet • Go to the main WordNet site: http://wordnet.princeton.edu/ • Open the wordnet folder on pongo: ~/dropbox/570/wordnet/dict
WordNet Vocabulary • See glossary at: http://wordnet.princeton.edu/gloss • synset: A synonym set; a set of words that are interchangeable in some context • lemma: lower case ASCII text of word as found in the WordNet database index files • lexical pointer: A lexical pointer indicates a relation between words in synsets
Navigating WordNet files • data.* files – the actual network files (synsets) • index.* files – contains lower case instances of all words in WordNet, with pointers to the synset entries in the network
WordNet data file Synset file offset 00045430 04 n 01 performance 3 003 @ 00033580 n 0000 ~ 00045680 n 0000 ~ 00045874 n 0000 | any recognized accomplishment; "they admired his performance under stress“ 00045680 04 n 01 overachievement 0 003 @ 00045430 n 0000 + 02537922 v 0101 ! 00045874 n 0101 | better than expected performance (better than might have been predicted from intelligence tests) Synset type File number # words in synset word # pointers to other synsets Type of pointer POS Pointer See: wndb
Pointer symbols • For nouns: ! Antonym @ Hypernym ~ Hyponym #m Member holonym #s Substance holonym #p Part holonym %m Member meronym %s Substance meronym %p Part meronym = Attribute + Derivationally related form See: wninput
WordNet index file abomination n 3 2 @ + 3 0 09613960 07401317 00734041 lemma (word) POS # pointers pointers synset file offset # synsets
WordNet tools • Many, many tools • General documentation: http://wordnet.princeton.edu/doc • Online query and lookup: http://wordnet.princeton.edu/perl/webwn • APIs and tools: http://wordnet.princeton.edu/links • WordNet::similarity: http://wn-similarity.sourceforge.net/ • WordNet::similarity web interface: http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi
WordNet and WSD • Milhalcea 2002 describes system to sense encode text using WordNet (and related tools and resources)
Milhalcea 2002 • Some tools and resources described: • Senseval • http://www.senseval.org/ • Evalutation exercises for Word Sense Disambiguation • Senseval-1 – 3, held in last several years, workshops at ACL • Senseval-4 coming up • Data and materials from Senseval-3 can be downloaded • Some useful materials for multiple languages • Materials and test data for English, Italian, Basque, Catalan, Chinese, Romanian, and Spanish
Milhalcea 2002 • Some tools and resources described: • Semcor • Sense tagged Brown corpus • Created at Princeton • Used for training WSD systems • Can be downloaded from Milhalcea’s web site: http://www.cs.unt.edu/~rada/downloads.html • We’re also planning on installing it on Pongo
McCarthy et al 2004 • Task: find the predominant word senses in untagged text • Unlike Milhalcea 2002, did not rely on supervised method using SemCor • Built a thesaurus from raw text and Wordnet • Intuition: word sense more likely to be determined from untagged corpus from context, affected by genre, domain or text type • Rather than relying on SemCor’s 250,000 words, where the word senses are rather limited
McCarthy et al • Thesaurus development relies on dependencies between “neighbors” • Look at distributional similarities between a word and its neighbors
McCarthy et al • Experimented with several similarity measures available in WordNet::similarity • First experiment used SemCor to see how well the unsupervised system worked • 2595 polysemous nouns in SemCor
McCarthy et al • Experiment #2 against SENSEVAL-2 English All Words Data • Comparison between the precision and recall for SemCor vs. their automatic data (and the SENSEVAL ceiling)
McCarthy et al • Some experiments with domain specific corpora gave these results: