600 likes | 894 Views
The Global Wordnet Grid: anchoring languages to universal meaning. Piek Vossen Irion Technologies/Free University of Amsterdam and Christiane Fellbaum Princeton University. Overview. Wordnet, EuroWordNet background Architecture of the Global Wordnet Grid Mapping wordnets to the Grid
E N D
The Global Wordnet Grid: anchoring languages to universal meaning Piek Vossen Irion Technologies/Free University of Amsterdam and Christiane Fellbaum Princeton University
Overview • Wordnet, EuroWordNet background • Architecture of the Global Wordnet Grid • Mapping wordnets to the Grid • Kyoto: an implementation of the Grid
WordNet1.5 • Developed at Princeton by George Miller and his team as a model of the mental lexicon. • Semantic network in which concepts are defined in terms of relations to other concepts. • Structure: • organized around the notion of synsets (sets of synonymous words) • basic semantic relations between these synsets • http://www.cogsci.princeton.edu/~wn/w3wn.html
EuroWordNet • The development of a multilingual database with wordnets for several European languages • Funded by the European Commission, DG XIII, Luxembourg as projects LE2-4003 and LE4-8328 • March 1996 - September 1999 • 2.5 Million EURO. • http://www.hum.uva.nl/~ewn • http://www.illc.uva.nl/EuroWordNet/finalresults-ewn.html
Domains Transport Road Water Air vehicle 1 car train 2 English Words 3 3 EuroWordnet architecture Top Ontology Fahrzeug 1 Object Auto Zug voertuig Device 1 2 auto trein TransportDevice German Words 4 2 liiklusvahend Dutch Words ENGLISH Car … Train … Vehicle 1 auto killavoor vehículo 2 1 Estonian Words véhicule auto tren 1 veicolo voiture train 1 2 Inter-Lingual-Index Spanish Words auto treno 2 dopravníprostředník French Words 2 1 Italian Words auto vlak 2 Czech Words
EuroWordNet • Wordnets are unique language-specific structures: • different lexicalizations • differences in synonymy and homonymy • different relations between synsets • same organizational principles: synset structure and same set of semantic relations. • Language independent knowledge is assigned to the ILI and can thus be shared for all language linked to the ILI: both an ontology and domain hierarchy
object artifact, artefact (a man-made object) natural object (an object occurring naturally) block instrumentality body box spoon bag device implement container tool instrument Autonomous & Language-Specific Wordnet1.5 Dutch Wordnet voorwerp {object} blok {block} lichaam {body} werktuig{tool} bak {box} lepel {spoon} tas {bag}
Linguistic versus Artificial Ontologies • Artificial ontology: • better control or performance, or a more compact and coherent structure. • introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), • neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise). • What properties can we infer for spoons? • spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking
Linguistic versus Artificial Ontologies Linguistic ontology: • Exactly reflects the relations between all the lexicalized words and expressions in a language. • Captures valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language. What words can be used to name spoons? spoon -> object, tableware, silverware, merchandise, cutlery,
Wordnets versus ontologies • Wordnets: • autonomous language-specific lexicalization patterns in a relational network. • Usage: to predict substitution in text for information retrieval, • text generation, machine translation, word-sense-disambiguation. • Ontologies: • data structure with formally defined concepts. • Usage: making semantic inferences.
The Multilingual Design • Inter-Lingual-Index: unstructured fund of concepts to provide an efficient mapping across the languages; • Index-records are mainly based on WordNet synsets and consist of synonyms, glosses and source references; • Various types of complex equivalence relations are distinguished; • Equivalence relations from synsets to index records: not on a word-to-word basis; • Indirect matching of synsets linked to the same index items;
Equivalent Near Synonym • 1. Multiple Targets (1:many) • Dutch wordnet: schoonmaken (to clean) matches with 4 senses of clean in WordNet1.5: • make clean by removing dirt, filth, or unwanted substances from • remove unwanted substances from, such as feathers or pits, as of chickens or fruit • remove in making clean; "Clean the spots off the rug" • remove unwanted substances from - (as in chemistry) • 2. Multiple Sources (many:1) • Dutch wordnet: versiersel near_synonym versiering ILI-Record: decoration. • 3. Multiple Targets and Sources (many:many) • Dutch wordnet: toestel near_synonym apparaat ILI-records: machine; device; apparatus; tool
Equivalent Hyperonymy Typically used for gaps in English WordNet: • genuine, cultural gaps for things not known in English culture: • Dutch: klunen, to walk on skates over land from one frozen water to the other • Dutch:citroenjenever, which is a kind of gin made out of lemon skin, • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English: • Dutch: kunstproduct = artifact substance <=> artifact object • Dutch: hoofd = human head and Dutch: kop = animal head, English uses head for both.
From EuroWordNet to Global WordNet • Currently, wordnets exist for more than 40 languages, including: • Arabic, Bantu, Basque, Chinese, Bulgarian, Estonian, Hebrew, Icelandic, Japanese, Kannada, Korean, Latvian, Nepali, Persian, Romanian, Sanskrit, Tamil, Thai, Turkish, Zulu... • Many languages are genetically and typologically unrelated • http://www.globalwordnet.org
Some downsides • Construction is not done uniformly • Coverage differs • Not all wordnets can communicate with one another • Proprietary rights restrict free access and usage • A lot of semantics is duplicated • Complex and obscure equivalence relations due to linguistic differences between English and other languages
Fahrzeug 1 Auto Zug 2 vehicle German Words 1 car train 2 English Words 3 3 vehículo 1 auto tren veicolo 1 2 Spanish Words auto treno 2 Italian Words Next step: Global WordNet Grid Inter-Lingual Ontology voertuig 1 auto trein Object 2 liiklusvahend Dutch Words 1 Device auto killavoor TransportDevice 2 Estonian Words véhicule 1 voiture train 2 dopravníprostředník French Words 1 auto vlak 2 Czech Words
GWNG: Main Features • Construct separate wordnets for each Grid language • Contributors from each language encode the same core set of concepts plus culture/language-specific ones • Synsets (concepts) can be mapped crosslinguistically via an ontology • No license constraints, freely available
The Ontology: Main Features • Formal, artificial ontology serves as universal index of concepts • List of concepts is not just based on the lexicon of a particular language (unlike in EuroWordNet) but uses ontological observations • Concepts are related in a type hierarchy • Concepts are defined with axioms
The Ontology: Main Features • In addition to high-level (“primitive”) concept ontology needs to express low-level concepts lexicalized in the Grid languages • Additional concepts can be defined with expressions in Knowledge Interchange Format (KIF) based on first order predicate calculus and atomic element
The Ontology: Main Features • Minimal set of concepts (Reductionist view): • to express equivalence across languages • to support inferencing • Ontology must be powerful enough to encode all concepts that are lexically expressed in any of the Grid languages
The Ontology: Main Features • Ontology need not and cannot provide a linguistic encoding for all concepts found in the Grid languages • Lexicalization in a language is not sufficient to warrant inclusion in the ontology • Lexicalization in all or many languages may be sufficient • Ontological observations will be used to define the concepts in the ontology
Ontological observations • Identity criteria as used in OntoClean (Guarino & Welty 2002), : • rigidity: to what extent are properties true for entities in all worlds? You are always a human, but you can be a student for a short while. • essence: what properties are essential for an entity? Shape is essential for a statue but not for the clay it is made of. • unicity:what represents a whole and what entities are parts of these wholes? An ocean is a whole but the water it contains is not.
Type-role distinction • Current WordNet treatment: (1) a husky is a kind of dog(type) (2) a husky is a kind of working dog (role) • What’s wrong? (2) is defeasible, (1) is not: *This husky is not a dog This husky is not a working dog Other roles: watchdog, sheepdog, herding dog, lapdog, etc….
Ontology and lexicon • Hierarchy of disjunct types: Canine PoodleDog; NewfoundlandDog; GermanShepherdDog; Husky • Lexicon: • NAMES for TYPES: {poodle}EN, {poedel}NL, {pudoru}JP • ((instance x Poodle) • LABELS for ROLES: {watchdog}EN, {waakhond}NL, {banken}JP ((instance x Canine) and (role x GuardingProcess))
Ontology and lexicon • Hierarchy of disjunct types: River; Clay; etc… • Lexicon: • NAMES for TYPES: {river}EN, {rivier, stroom}NL • ((instance x River) • LABELS for dependent concepts: {rivierwater}NL (water from a river => water is not Unit) ((instance x water) and (instance y River) and (portion x y) {kleibrok}NL (irregularly shared piece of clay=>Non-essential) ((instance x Object) and (instance y Clay) and (portion x y) and (shape X Irregular))
Rigidity • The “primitive” concepts represented in the ontology are rigid types • Entities with non-rigid properties will be represented with KIF statements • But: ontology may include some universal, core concepts referring to roles like father, mother
Properties of the Ontology • Minimal: terms are distinguished by essential properties only • Comprehensive: includes all distinct concepts types of all Grid languages • Allows definitions via KIF of all lexemes that express non-rigid, non-essential properties of types • Logically valid, allows inferencing
Mapping Grid Languages onto the Ontology • Explicit and precise equivalence relations among synsets in different languages, which is somehow easier: • type hierarchy is minimal • subtle differences can be encoded in KIF expressions • Grid database contains wordnets with synsets that label • either “primitive” types in the hierarchies, • or words relating to these types in ways made explicit in KIF expressions • If 2 lgs. create the same KIF expression, this is a statement of equivalence!
How to construct the GWNG • Take an existing ontology as starting point; • Use English WordNet to maximize the number of disjunct types in the ontology; • Link English WordNet synsets as names to the disjunct types; • Provide KIF expressions for all other English words and synsets
How to construct the GWNG • Copy the relation from the English Wordnet to the ontology to other languages, including KIF statements built for English • Revise KIF statements to make the mapping more precise • Map all words and synsets that are and cannot be mapped to English WordNet to the ontology: • propose extensions to the type hierarchy • create KIF expressions for all non-rigid concepts
Initial Ontology: SUMO (Niles and Pease) SUMO = Suggested Upper Merged Ontology --consistent with good ontological practice --fully mapped to WordNet(s): 1000 equivalence mappings, the rest through subsumption --freely and publicly available --allows data interoperability --allows NLP --allows reasoning/inferencing
Mapping Grid languages onto the Ontology • Check existing SUMO mappings to Princeton WordNet -> extend the ontology with rigid types for specific concepts • Extend it to many other WordNet synsets • Observe OntoClean principles! (Synsets referring to non-rigid, non-essential, non-unicitous concepts must be expressed in KIF)
Lexicalizations not mapped to WordNet • Not added to the type hierarchy: {straathond}NL (a dog that lives in the streets) • ((instance x Canine) and (habitat x Street)) • Added to the type hierarchy: {klunen}NL (to walk on skates from one frozen body to the next over land) KluunProcess => WalkProcess Axioms: (and (instance x Human) (instance y Walk) (instance z Skates) (wear x z) (instance s1 Skate) (instance s2 Skate) (before s1 y) (before y s2) etc… • National dishes, customs, games,....
Most mismatching concepts are not new types • Refer to sets of types in specific circumstances or to concept that are dependent on these types, next to {rivierwater}NL there are many others: {theewater}NL (water used for making tea) {koffiewater}NL (water used for making coffee) {bluswater}NL (water used for making extinguishing file) • Relate to linguistic phenomena: • gender, perspective, aspect, diminutives, politeness, pejoratives, part-of-speech constraints
KIF expression for gender marking • {teacher}EN ((instance x Human) and (agent x TeachingProcess)) • {Lehrer}DE ((instance x Man) and (agent x TeachingProcess)) • {Lehrerin}DE ((instance x Woman) and (agent x TeachingProcess))
KIF expression for perspective sell: subj(x), direct obj(z),indirect obj(y) versus buy: subj(y), direct obj(z),indirect obj(x) (and (instance x Human)(instance y Human) (instance z Entity) (instance e FinancialTransaction) (source x e) (destination y e) (patient e) The same process but a different perspective by subject and object realization: marry in Russian two verbs, apprendre in French can mean teach and learn
Part-of-speech mismatches • {bankdrukken-V}NL vs.{bench press-N}EN • {gehuil-N}NL vs. {cry-V}EN • {afsluiting-N}NL vs. {close-V}EN • Process in the ontology is neutral with respect to POS!
Parallel Noun and Verb hierarchy Encoded once as a Process in the ontology! • event • act • deed • sail • promise • change • movement • change of location • to happen • to act • to do • to sell • a promise • to change • to move • to move position
Mixed Noun and Adjective hierarchy • Colour: red, blue, green, etc. • Height: high, low • Size: big, small • Emotion: sad, angry, happy, anxious • etc. Encoded once as a attributes in the ontology!
Aspectual variants • Slavic languages: two members of a verb pair for an ongoing event and a completed event. • English: can mark perfectivity with particles, as in the phrasal verbs eat up and read through. • Romance languages: mark aspect by verb conjugations on the same verb. • Dutch, verbs with marked aspect can be created by prefixing a verb with door: doorademen, dooreten, doorfietsen, doorlezen, doorpraten(continue to breathe/eat/bike/read/talk). • These verbs are restrictions on phases of the same process • Which does NOT warrant the extension of the ontology with separate processes for each aspectual variant
Aspectual lexicalization • Regular compositional verb structures: doorademen: (lit. through+breath, continue to breath) doorbetalen: (lit. through+pay, continue to pay) doorlopen: (lit. through+walk, continue to walk) doorfietsen: (lit. through+walk, continue to walk) doorrijden: (lit. through+walk, continue to walk) (and (instance x BreathProcess)(instance y Time) (instance z Time) (end x z) (expected (end x y) (after z y))
Lexicalization of Resultatives • MORE GENERAL VERBS: openmaken: (lit. open+make, to cause to be open); dichtmaken: (lit. close+make, to cause to be open); • MORE SPECIFIC VERBS: openknijpen (lit. open+squeeze, to open by squeezing) has_hyperonym knijpen (squeeze) & openmaken (to open) opendraaien (lit. open+turn, to open by turning) has_hyperonym draaien (to turn) & openmaken (to open) dichtknijpen: (lit. closed+squeeze, to close by squeezing) has_hyperonym knijpen (squeeze) & dichtmaken (to close) dichtdraaien: (lit. closed +turn, to close by turning) has_hyperonym draaien (to turn) & dichtmaken (to close)
Kinship relations in Arabic • عَم(Eam~) father's brother, paternal uncle. • خَال (xaAl) mother's brother, maternal uncle. • عَمَّة (Eam~ap) father's sister, paternal aunt. • خَالَة (xaAlap) mother's sister, maternal aunt
Kinship relations in Arabic • ......... • شَقِيقَة ($aqiyqapfull) sister, sister on the paternal and maternal side (as distinct from أُخْت(>uxot): 'sister' which may refer to a 'sister' from paternal or maternal side, or both sides). • ثَكْلان (vakolAna) father bereaved of a child (as opposed to يَتِيم(yatiym) or يَتِيمَة(yatiymap) for feminine: 'orphan' a person whose father or mother died or both father and mother died). • ثَكْلَى (vakolaYa) other bereaved of a child (as opposed to يَتِيم or يَتِيمَة for feminine: 'orphan' a person whose father or mother died or both father and mother died).
Complex Kinship concepts father's brother, paternal uncle WORDNET paternal uncle => uncle => brother of ....???? ONTOLOGY (=> (paternalUncle ?P ?UNC) (exists (?F) (and (father ?P ?F) (brother ?F ?UNC))))
Fine tune equivalence relations • {rivier}NL (and (instance x River) (instance y RiverMouth) (instance z Country) (part y x) (location y z) • {stroom}NL (and (instance x River) (instance y RiverMouth) (instance p RiverPart) (not (equal p y) (instance z Country) (location p z) (not (location y z))
Universality as evidence • If lexicalization of the specific process is more universal it can be seen as evidence that the specific processes should be listed in the ontology and not the generic verb: • English verb cut abstracts from the precise process but there are troponyms that implicate the manner : snip, clip imply scissors, chop and hack a large knife or an axe • Dutch there is no general verb but only specific verbs: knippen “clip, snip, cut with scissors or a scissor-like tool'”, snijden “cut with a knife or knife-like tool”, hakken “chop, hack, to cut with an axe, or similar tool”). • If Father is lexicalized in most languages we add it to the ontology even when it is NOT Rigid!
Universality as evidence • Artifact substance is lexicalized in Dutch and other languages => ArtifactObject in SUMO needs to be generalized to Artifact so that it can be applied to both substances and objects
Open Questions/Challenges • What is a word, i.e., a lexical unit? • What is the status of complex lexemes like English lightning rod, word of mouth, find out, kick the bucket? • What is the status of compounds in Germanic languages and Chinese? • "hottentottententententoonstelling" (exposition of tents of the "hottentotten" (African tribe)) • What is a semantic unit, i.e. a concept?