430 likes | 543 Views
Building Wordnets. Piek Vossen, Irion Technologies Piek.Vossen@irion.nl. Overview. Starting points Semantic framework Process overview Methodologies in other projects Multilinguality. Starting points. Purpose of the wordnet database: education, science, applications
E N D
Building Wordnets Piek Vossen, Irion Technologies Piek.Vossen@irion.nl
Overview • Starting points • Semantic framework • Process overview • Methodologies in other projects • Multilinguality
Starting points • Purpose of the wordnet database: • education, science, applications • formal ontology or linguistic ontology • making inferences or lexical substitution • conceptual density or large coverage • Distributed development • Reproducability • Available resources • Language-specific features • (Cross-language) compatibility • Exploit cummunity resources by projecting conceptual relations on a target wordnet
Wordnet1.5 Dutch Wordnet voorwerp {object} object artifact, artefact (a man-made object) natural object (an object occurring naturally) blok {block} lichaam {body} werktuig {tool} block instrumentality body box spoon bag bak {box} lepel {spoon} tas {bag} device implement container tool instrument Differences in wordnet structures - Artificial Classes versus Lexicalized Classes: instrumentality; natural object - Lexicalization differences of classes: container and artifact (object) are not lexicalized in Dutch
Linguistic versus conceptual ontologies • Conceptual ontology: • A particular level or structuring may be required to achieve a better control or performance, or a more compact and coherent structure. • Introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), • Neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise). • What properties can we infer for spoons? • spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking • Linguistic ontology: • Exactly reflects the relations between all the lexicalized words and expressions in a language. • Valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language. • What words can be used to name spoons? • spoon -> object, tableware, silverware, merchandise, cutlery,
Wordnets as Linguistic Ontologies Main purpose is to predict what words can be used as substitutes in language, considering all the lexicalized words in a language. • Classical Substitution Principle: • Any word that is used to refer to something can be replaced by its synonyms, hyperonyms and • hyponyms: • horse stallion, mare, pony, mammal, animal, being. • It cannot be referred to by co-hyponyms and co-hyponyms of its hyperonyms: • horse X cat, dog, camel, fish, plant, person, object. • Conceptual Distance Measurement: • Number of hierarchical nodes between words is a measurement of closeness, • where the level and the local density of nodes are additional factors.
Define a semantic framework • Definition of relations • Diagnostic frames (Cruse 1986) • Examples and corpus data • Top-level ontology • Constraints on relations • Type consistency • Large scale validation
Techniques • Manual encoding and verification • Automatic extraction: • definitions • synonyms • distribution and similarity patterns in copora • defining contexts, e.g. “cats and other pets” • parallel corpora, e.g. bible translations • morphological structure • bilingual dictionaries • Encode source and status of data: • who, when, based on what algorithm, validated, final
Encoding cycle • 1. Collecting data • Vocabulary: what is the list of words of a language? • Concepts: what is the list of concepts related to the vocabulary? • 2. Encoding data: • Defining synsets • Defining language internal relations: hyponymy, meronymy roles, causal relations • Defining equivalence relations to English • Defining other relations,e.g. Ontology types, Domains • 3. Validation • 4. Go to 1.
Where to start? • How to get a first selection: • Words (alphabetic, frequency) -> concepts -> relations • Concept (hyperonym, domain, semantic feature) -> words -> concepts -> relations • How to get a complete overview of words and expressions that belong to a segment of a wordnet? • Up to 20 hyperonyms for instrumentality: instrument, instrumentality, means, tool, device, machine, apparatus, .... • iterative process: collect, structure, collect, restructure... • using multiple sources of evidence • comparing results, e.g. tri-cycle is a toy or a vehicle
Synonymy as a basis? • Synsets are the core unit of a wordnet database • Synonymy is only vaguely defined: substitution in a context. • Synonyms are very hard to detect • Other relations (role relations, causal relations): • easier to detect and encode • easier to validate within a formal framework • easier to validate in a corpus • Rich set of relations per concept help alignment with other resources
Diagnostic frames and examples Agent Involvement (A/an) X is the one/that who/which does the Y, typically intentionally. Conditions: - X is a noun - Y is a verb in the gerundive form Example: A teacher is the one who does the teaching intentionally Effect: {to teach} (Y) INVOLVED_AGENT {teacher} (X) Patient Involvement (A/an) X is the one/that who/which undergoes the Y Conditions: - X is a noun - Y is a verb in the gerundive form Example: A learner is the one who undergoes the learning Effect: {to learn} (Y) INVOLVED_PATIENT {learner} (X)
Diagnostic frames and examples Result Involvement A/an) X is comes into existence as a result of Y, where X is a noun and Y is a verb in the gerundive form and a hyponym of “make”, “produce”, “generate”. Example: A crystal comes into existence as a result of crystalizing A crystal is the result of crystalizing A crystal is created by crystalizing Effect: {to crystalize} (Y) INVOLVED_RESULT {crystal} (X) Comments: • Special kind of patient relation. The entity is not jut changed or affected but it comes into existence as a result of the event: • Only applies to concrete entities (1stOrder) or mental objects such as ideas (3rdOrder). • Situations that result from other situations are related by the CAUSE relation.
Hyponymy overloading (Guarino 1998, Vossen and Bloksma 1998). • The vocabulary does not clearly differentiate between orthogonal roles and disjoint types: • role: passenger, teacher, student • type: dog; cat • ?: • knife ->weapon, cutlery; spoon -> container, cutlery • food • material <- building material <-?- stone; <-?-water; <- brick; • Disjunctive and conjunctive hyperonyms: • albino -> animal or plant • spoon -> cutlery & container
Hyponymy restructuring ziekte (disease) ingewandsziekte (bowel disease) dierenziekte (animal disease) infectieziekte (infectious disease) haringwormziekte (anisakiasis: bowel disease of herrings) kolder (staggers: brain disease of cattle) veeziekte (cattle disease) vuilbroed (infectiousinfectious disease of bees)
Methodologies in a number of projects • Princeton Wordnet • EuroWordNet: • English, Dutch, German, French, Spanish, Italian, Czech, Estonian • 10,000 up to 50,000 synsets • BalkaNet: • Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian • 10,000 synsets
Main strategies for building wordnets • Expand approach: translate WordNet synsets to another language and take over the structure • easier and more efficient method • compatible structure with WordNet • vocabulary and structure is close to WordNet but also biased • can exploit many resources linked to Wordnet: SUMO, Wordnet domains, selection restriction from BNC, etc... • Merge approach: create an independent wordnet in another language and align it with WordNet by generating the appropriate translations • more complex and labor intensive • different structure from WordNet • language specific patterns can be maintained, i.e. very precise substitution patterns
Aligning wordnets English wordnet Dutch wordnet object artifact object natural object instrument muziekinstrument musical instrument ? orgel orgel organ organ ? organ orgaan ? hammond orgel hammond organ
General criteria for approach: • Maximize the overlap with wordnets for other languages • Maximize semantic consistency within and across wordnets • Maximally focus the manual effort where needed • Maximally exploit automatic techniques
Top-down methodology • Develop a core wordnet (5,000 synsets): • all the semantic building blocks or foundation to define the relations for all other more specific synsets, e.g. building -> house, church, school • provide a formal and explicit semantics • Validate the core wordnet: • does it include the most frequent words? • are semantic constraints violated? • Extend the core wordnet: (5,000 synsets or more): • automatic techniques for more specific concepts with high-confidence results • add other levels of hyponymy • add specific domains • add ‘easy’ derivational words • add ‘easy’ translation equivalence • Validate the complete wordnet
Developing a core wordnet • Define a set of concepts(so-called Base Concepts) that play an important role in wordnets: • high position in the hierarchy & high connectivity • represented as English WordNet synsets • Common base concepts: shared by various wordnets in different languages • Local base concepts: not shared • EuroWordNet: 1024 synsets, shared by 2 or more languages • BalkaNet: 5000 synsets (including 1024) • Common semantic framework for all Base Concepts, in the form of a Top-Ontology • Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets) • Manually build and verify the hypernym relations for the Base Concepts • All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
Top-down methodology Top-Ontology 63TCs Hypero nyms Hypero nyms CBC Represen- tatives Local BCs 1024 CBCs CBC Repre-senta. Local BCs WMs related via non-hypo nymy WMs related via non-hypo nymy Remaining WordNet1.5 Synsets First Level Hyponyms First Level Hyponyms Remaining Hyponyms Remaining Hyponyms Inter-Lingual-Index
Advantages of the approach • Well-defined semantics that can be inherited down to more specific concepts • Apply consistency checks • Automatic techniques can use semantic basis • Most frequent concepts and words are covered • High overlap and compatibility with other wordnets • Manual effort is focussed on the most difficult concepts and words
EWN Interlingual Relations • EQ_SYNONYM: there is a direct match between a synset and an ILI-record • EQ_NEAR_SYNONYM: a synset matches multiple ILI-records simultaneously, • HAS_EQ_HYPERONYM: a synset is more specific than any available ILI-record. • HAS_EQ_HYPONYM: a synset can only be linked to more specific ILI-records. • other relations: CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE, EQ_IS_STATE_OF/EQ_BE_IN_STATE
Complex equivalence relations • eq_near_synonym • 1. Multiple Targets • One sense for Dutch schoonmaken (to clean) which simultaneously matches with at least 4 senses of clean in WordNet1.5: • {make clean by removing dirt, filth, or unwanted substances from} • {remove unwanted substances from, such as feathers or pits, as of chickens or fruit} • (remove in making clean; "Clean the spots off the rug") • {remove unwanted substances from - (as in chemistry)} • The Dutch synset schoonmaken will thus be linked with an eq_near_synonym relation to all these sense of clean. • 2. Multiple Source meanings • Synsets inter-linked by a near_synonym relation can be linked to same target ILI-record(s), either with an eq_synonym or an eq_near_synonym relation: • Dutch wordnet: toestel near_synonym apparaat • ILI-records: {machine}; {device}; {apparatus}; {tool}
Complex equivalence relations • has_eq_hyperonym • Typically used for gaps in WordNet1.5 or in English: • genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: Dutch hoofd only refers to human head and Dutch kop only refers to animal head, English uses head for both. • has_eq_hyponym • Used when wordnet1.5 only provides more narrow terms. In this case there can only be a pragmatic difference, not a genuine cultural gap, e.g.: Spanish dedo can be used to refer to both finger and toe.
Overview of equivalence relations to the ILI Relation POS Sources: Targets Example eq_synonym same 1:1 auto : voiture car eq_near_synonym any many : many apparaat, machine, toestel: apparatus, machine, device eq_hyperonym same many : 1 (usually) citroenjenever: gin eq_hyponym same (usually) 1 : many dedo : toe, finger eq_metonymy same many/1 : 1 universiteit, universiteitsgebouw: university eq_diathesis same many/1 : 1 raken (cause), raken: hit eq_generalization same many/1 : 1 schoonmaken : clean
Filling gaps in the ILI Types of GAPS • genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, • Non-productive • Non-compositional • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier) • Productive • Compositional • Universality of gaps: Concepts occurring in at least 2 languages
Productive and Predictable Lexicalizations exhaustively linked to the ILI beat hypernym hypernym {doodslaanV}NL {totschlagenV}DE kill hypernym hypernym {doodstampenV}NL {tottrampelnV}DE stamp hypernym {doodschoppenV}NL kick cashier hypernym hypernym {cajeraN}ES in_state {casière}NL in_state female hypernym fish {alevínN}ES in_state young
Core wordnet 5000 synsets = 1000 Synsets 5000 Synsets WordNet Synsets 1045678-v {darrasa} Top-down methodology Hyper nyms Sumo Ontology Arabic word frequency English Arabic Lexicon teach - darrasa CBC SBC ABC EuroWordNet BalkaNet Base Concepts WordNet Synsets 1045678-v {teach} Next Level Hyponyms Arabic roots & derivation rules WordNet Synsets WordNet Domains More Hyponyms Domain “chemics” WordNet Synsets Named Entities Named Entities Easy Translations Domain Arabic Wordnet English Wordnet
Top-down methodology Hyper nyms Sumo Ontology = Arabic word frequency English Arabic Lexicon 1000 Synsets SBC CBC ABC EuroWordNet BalkaNet Base Concepts 5000 Synsets Next Level Hyponyms Arabic roots & derivation rules WordNet Synsets WordNet Domains More Hyponyms Domain “chemics” WordNet Synsets Named Entities Named Entities Easy Translations Domain Arabic Wordnet English Wordnet
ziekte (disease) infectieziekte (infectious disease) ingewandsziekte (bowel disease) dierenziekte (animal disease) haringwormziekte (anisakiasis: bowel disease of herrings) kolder (staggers: brain disease of cattle) veeziekte (cattle disease) vuilbroed (infectiousinfectious disease of bees)
ziekte (disease) ingewandsziekte (bowel disease) dierenziekte (animal disease) infectieziekte (infectious disease) haringwormziekte (anisakiasis: bowel disease of herrings) veeziekte (cattle disease) vuilbroed (infectiousinfectious disease of bees) kolder (staggers: brain disease of cattle)
Resources • Monolingual dictionaries: • definitions • synonym relations • other relations • Bi-lingual dictionaries: L-English, English-L • Ontologies • Thesauri • Corpora: • monolingual • parallel