330 likes | 653 Views
The WordNet Lexical Database. Bernardo Magnini ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica Trento - Italy. Outline. WordNet: introduction Extending WordNet Languages other than English New information WordNet as a (linguistic) ontology Using WordNet
E N D
The WordNet Lexical Database Bernardo Magnini ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica Trento - Italy
Outline • WordNet: introduction • Extending WordNet • Languages other than English • New information • WordNet as a (linguistic) ontology • Using WordNet • Word sense disambiguation • Information Retrieval/ Question Answering • Semantic Web
WordNet • Electronic Lexical Database for the English language realized at Princeton University by George Miller’s team • Based on psycholinguistic theories • Several releases: from version 1.0 in 1991 to version 1.7.1 in 2001 • WordNet 2 (??) • WordNet is a public domain resource http://www.cogsci.princeton.edu/~wn/ Fellbaum C. (Ed.): WordNet, an Electronic Lexical Database, MIT Press, 1998 • Global WordNet Association (GWA) • Conference, workshops
Word Forms Word Meanings F1 F2 F3 … Fn E1,1 E1,2 E2,2 E3,3 Em,n M1 M2 M3 … Mm . . . Lexical Matrix • Mappings between word forms and meanings are many:many • F1 and F2 are synonyms • F2 is polysemous
Basic Primitives • Word forms: lexical items in a language (i.e. no artificial concepts), including collocations • Senses: a meaning of a word form • Synsets: a set of synonym senses • Relations: • Lexical: among senses • Semantic: among synsets
Lexical Relations • Synonymy • Two expression are synonymous if the substitution of one for the other does not alter the truth value of the sentence (Leibniz) • => need to partition WordNet into nouns, verbs, adjectives, and adverbs • Antonymy ex. [rich/poor] [rise/fall] • The antonym of a word x is sometimes not-x, but not always: not rich ≠> poor • Main organization principle for the adjectives
Semantic relations (1) • Hyponymy/Hyperonymy (the ISA relation) A synset {x1, x2, … } is an hyponym of the synset {y1, y2, …} if native speakers accept sentences such as An x is a (kind of) y • Transitive and asimmetrical • WordNet is a graph, even if normally synsets have a single hyperonym • Main organization principle of nouns
Semantic relations (2) • Meronymy/Holonymy(the Part-Of relation) A synset {x1, x2, … } is a meronym of the synset {y1, y2, …} if native speakers accept sentences such as An x is a part of y or A y has an x (as a part) • Meronymy is transitive and asimmetrical and can be used to construct a part hierarchy
Semantic relations (3) • Peculiar semantic relations in the verb hierarchy • Troponym: a verb expressing a specific manner elaboration of another verb (e.g. walk move) X is a troponym of Y if to X is to Y in some manner or Y is a particular way to X • Entailment: a verb X entails Y if X cannot be done unless Y is or has been done (e.g. snore sleep)
SemCor • English, part of the Brown Corpus • 700,000 running words, annotated with Part of Speech • 200,000 words annotated with WordNet senses (and lemmas)
WordNet Extensions • Computational needs: • WordNets for languages other than English • New semantic relations • WordNet as an Ontology • Domain specific wordnets • Automatic acquisition of information • Interchange formats
Languages other than English • EuroWordNetproject: monolingual wordnets are connected through an Interlingual Index (ILI) – Distributed by ELDA/ELRA • Italian, Spanish, Catalan, Basque, French, Estonian, Portuguese, Swedish, Dutch, German, • Balkanet Project: Bulgarian, Greek, Romanian, Slovenian • Danish, Hebrew • Chinese, some Indian languages • Lexical gaps
New Relations (1) • Derivation relations(Princeton – WordNet-2) • Invent inventor (need of disambiguation) • Gloss disambiguation (Extended WordNet – Moldovan 2000) • Glosses are parsed, disambiguated and converted in a logical form • WordNet Domains (Magnini, Cavaglia, 2000) (ITC-irst) • Synsets are labeled with domains, such as Medicine, Architecture, Sport, …
WordNet Domains • Integrate taxonomic and domain oriented information • Cross hierarchy relations • doctor#2 [Medicine] --> person#1 • hospital#1 [Medicine] --> location#1 • Cross category relations: operate#3 [Medicine] • Cross language information
New Relations (2) • Classes versus Instances: • Bush<belong-to-class> person • Role relations for verbs: • singer <role-agent>song • Implicit knowledge(Peters, 2002) • Discover regular polysemy relations in WordNet: Bank#1 (an istitution) bank#2 (a building)
Automatic Acquisition • MEANING project (IST-2001-34460) • Topic Signatures (Aguirre, 2001) • Synset related words automatically extracted from the Web • Automatic collection of sense examples (Leacock et al. 98, Mihalcea and Moldovan 99) • Synsets Selectional Preferences (Carrol, 2001) • From the BNC corpus • WordNet Annotated corpora • Open Mind Word Expert (Mihalcea, 2002)
WordNet as an Ontology • Some relations contradict ontological principles • OntoClean approach (Guarino, 2002): • Confusion between concepts and individuals (e.g. Palestine and Trust_Territories at the same level) • Role/Type: a role cannot subsume a type (e.g. Person <isa> Causal_agent
Domain Specific WordNets • Extension of WordNet hierarchies using domain-specific document collections (Vossen, 2001) (Buitelaar, 2001) (Velardi, 2001) • Tuning of WordNet synsets (Turcato, 2000) • Merging generic and specialized wordnets (Magnini et al. 2002): • Overlaps and inconsistencies among sysnsets • Precedence rules for inheritance
Interchange Formats • XML: • Implementation independent • Easily extensible to new relations • there are at least three different versions; none of them is yet much used • Mappings among different wordnet versions: • 1.5 1.6 • 1.6 1.7 • May contain errors
Using WordNet • Large diffusion within the Natural Language Processing community • Suitable for open-domain, content-based tasks where interpretation based on lexical semantics is required • Algorithms: take advantage of the wordnet semantic relations • Issues: fine grained sense distinctions • Applicative areas: Query expansion in IR, Word Sense Disambiguation, Question Answering
Distance/Similarity Algorithms • Conceptual distance (Agirre-Rigau, 1995) • Consider the density of the taxonomy • Semantic similarity (Resnik, 1995) • The node with the higher information content connecting two nodes Sim(c1, c2) = max [-log p(c)] Where c is a node on a isa-path connecting c1 and c2 And p(c) is a probability computed considering the occurrence of c in a corpus.
Sense Distinctions • In WordNet there are sense distinctions difficult to understand • Many applications would benefit from polysemy reduction • Sense clustering methodologies: • Based on domain information • Based on aligned corpora in different languages
WordNet and Word Sense Disambiguation • As a sense repository • For the SENSEVAL competition • Manual annotated data are required for training systems based on machine learning algorithms • As an information source for knowledge-based algorithms
IR: Query Expansion • Open debate: • Semantic information is not useful (Voorhees, 1994) • WSD with performance < 90% decrease IR results (Sanderson, 1994); current WSD systems perform less then 80% • Semantic information significantly increases the IR performances (up to 30%) (Gonzalo, 1998) • Recent experiments (de Luopy, 2002) show that using synonyms and WSD (72% accuracy) in query expansion slightly (2-3%) improve performances
WordNet in Question/Answering • Answer type identification(Harabagiu, 2001: top score at TREC-QA-2000); • Answer types defined on the WordNet taxonomy • Answer extraction • Named entities recognition based on WordNet Question/answer relation discovery in passage retrieval (Pasca, 2001)
Semantic Web • Interpreting semi-structured knowledge sources • Directories, file systems, catalogues • Implicit knowledge • Linguistic analysis of labels based on WordNet
Conclusions • WordNet as a linguistic ontology • Using WordNet, as it is, in applicative tasks is not easy: “The art of using WordNet” • Extensions, such as domains, multilingual wordnets, etc., are required • Still preliminary results in IR, QA, WSD • Good news: a more and more large community is using WordNet