1 / 78

Piek Vossen VU University Amsterdam

From WordNet, to EuroWordNet, to the Global Wordnet Grid: anchoring languages to universal meaning. Piek Vossen VU University Amsterdam. Overview. Wordnet, EuroWordNet Global Wordnet Grid Stevin project Cornetto 7 th Frame work project KYOTO. WordNet. http://wordnet.princeton.edu/

stacy
Download Presentation

Piek Vossen VU University Amsterdam

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From WordNet, to EuroWordNet, to the Global Wordnet Grid: anchoring languages to universal meaning Piek Vossen VU University Amsterdam Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  2. Overview • Wordnet, • EuroWordNet • Global Wordnet Grid • Stevin project Cornetto • 7th Frame work project KYOTO Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  3. WordNet • http://wordnet.princeton.edu/ • Lexical semantic database for English • Developed by George Miller and his team at Princeton University, as the implementation of a mental model of the lexicon • Organized around the notion of a synset: a set of synonyms in a language that represent a single concept • Semantic relations between concepts (synsets) and not between words • Currently covers over 117,000 concepts (synsets) and over 150,000 English words Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  4. animal cat dog kitten puppy Relational model of meaning animal kitten man boy man woman cat boy girl dog puppy woman Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  5. {conveyance;transport} {vehicle} {armrest} {car mirror} {motor vehicle; automotive vehicle} {car door} {doorlock} {bumper} {hinge; flexible joint} {car window} {cruiser; squad car; patrol car; police car; prowl car} {cab; taxi; hack; taxicab} Wordnet: a network of semantically related words meronyms hyper(o)nym {car; auto; automobile; machine; motorcar} hyponym • Hyponymy and meronymy relations are: • transitive • directed Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  6. Wordnet Semantic Relations • WN 1.5 starting point • The ‘synset’ as a weak notion of synonymy: • “two expressions are synonymous in a linguistic context C • if the substitution of one for the other in C does not alter • the truth value.” (Miller et al. 1993) • Relations between synsets: • Example • HYPONYMY noun-to-noun car/ vehicle • verb-to-verb walk/ move • MERONYMY noun-to-noun head/ nose • ANTONYMY adjective-to-adjective good/bad • verb-to-verb open/ close • ENTAILMENT verb-to-verb buy/ pay • CAUSE verb-to-verb kill/ die Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  7. Wordnet Data Model Vocabulary of a language Relations Concepts • rec: 12345 • financial institute 1 polysemy bank rec: 54321 - side of a river 2 polysemy & synonymy rec: 9876 - small string instrument 1 fiddle violin type-of rec: 65438 - musician playing violin 2 fiddler violist rec:42654 - musician type-of rec:35576 - string of instrument 1 part-of polysemy string rec:29551 - underwear 2 rec:25876 - string instrument Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  8. Some observations on Wordnet • synsets are more compact representations for concepts than word meanings in traditional lexicons • synonyms and hypernyms are substitutional variants: • begin – commence • I once had a canary. The bird got sick. The poor animal died. • hyponymy and meronymy chains are important transitive relations for predicting properties and explaining textual properties: object -> artifact -> vehicle -> 4-wheeled vehicle -> car • strict separation of part of speech although concepts are closely related (bed – sleep) and are similar (dead – death) • lexicalization patterns reveal important mental structures Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  9. Lexicalization patterns entity 25 unique beginners object organism garbage threat artifact animal plant waste building bird tree flower basic level concepts church canary dog crocodile rose • balance of two principles: • predict most features • apply to most subclasses • where most concepts are created • amalgamate most parts • most abstract level to draw a pictures common canary abbey

  10. Wordnet top level Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  11. tail leg Meronymy & pictures beak

  12. Meronymy & pictures Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  13. Wordnet 3.0 statistics Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  14. Wordnet 3.0 statistics Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  15. http://www.visuwords.com Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  16. Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  17. Usage of Wordnet • Mostly used database in language technology • Enormous impact in language technology development • Large • Free and downloadable • English Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  18. Usage of Wordnet • Improve recall of textual based analysis: • Query -> Index • Synonyms: commence – begin • Hypernyms: taxi -> car • Hyponyms: car -> taxi • Meronyms: trunk -> elephant • Lexical entailments: gun -> shoot • Inferencing: • what things can burn? • Expression in language generation and translation: • alternative words and paraphrases Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  19. Improve recall • Information retrieval: • effective on small databases without redundancy, e.g. image captions, video text • Text classification: • expand small training sets • reduce training effort • Question & Answer systems • question classification: who, where, what, when • match answers to question types Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  20. Improve recall • Anaphora resolution: • The girl fell off the table. She.... • The glass fell of the table. It... • Coreference resolution: • When he moved the furniture, the antique table got damaged. • Information extraction (unstructed text to structured databases): • generic forms or patterns "vehicle" - > text with specific cases "car" Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  21. Improve recall • Summarizers: • Sentence selection based on word counts -> concept counts • Avoid repetition in summary -> language generation, pick out another synonym or hypernym • Limited inferencing: detect locations, people, organisations, etc. Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  22. Enabling technologies • Semantic similarity: what sentences or expressions are semantically similar? • Semantic relatedness and textual entailment: smoke entails fire, fire entails damage • Word-Senses-Disambiguation • Erwin Marsi, University of Tilbug, http://daeso.uvt.nl/demos/index.html Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  23. Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  24. Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  25. found intersection relevant query: “cell” Recall & Precision “jail” “nerve cell” “police cell” “cell phone” “mobile phones” “neuron” recall = doorsnede / relevant precision = doorsnede / gevonden Recall < 20% for basic search engines! (Blair & Maron 1985)‏

  26. Many others • Data sparseness for machine learning: hapaxes can be replaced by semantic classes that match classes from the training set • Use redundancy for more robustness: spelling correction and speech recognition can built semantic expectations using Wordnet and make better choices • Sentiment and opinion mining • Natural language learning Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  27. EuroWordNet • The development of a multilingual database with wordnets for several European languages • Funded by the European Commission, DG XIII, Luxembourg as projects LE2-4003 and LE4-8328 • March 1996 - September 1999 • 2.5 Million EURO. • http://www.hum.uva.nl/~ewn • http://www.illc.uva.nl/EuroWordNet/finalresults-ewn.html Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  28. EuroWordNet • Languages covered: • EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian • EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian. • Size of vocabulary: • EuroWordNet-1: 30,000 concepts - 50,000 word meanings. • EuroWordNet-2: 15,000 concepts- 25,000 word meaning. • Type of vocabulary: • the most frequent words of the languages • all concepts needed to relate more specific concepts Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  29. Domains Ontology bewegen gaan move go III 2OrderEntity berijden Traffic I I III III III Location Dynamic II Air Road` rijden ride drive Lexical Items Table Lexical Items Table Lexical Items Table Lexical Items Table III III II ILI-record {drive} guidare conducir cavalcare III cabalgar jinetear III mover transitar andare muoversi EuroWordNet Model II II Inter-Lingual-Index I = Language Independent link II = Link from Language Specific to Inter lingual Index III = Language Dependent Link Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  30. Differences in relations between EuroWordNet and WordNet • Added Features to relations • Cross-Part-Of-Speech relations • New relations to differentiate shallow hierarchies • New interpretations of relations Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  31. EWN Relationship Labels {airplane} HAS_MERO_PART: conj1 {door} HAS_MERO_PART: conj2 disj1 {jet engine} HAS_MERO_PART: conj2 disj2 {propeller} {door} HAS_HOLO_PART: disj1 {car} HAS_HOLO_PART: disj2 {room} HAS_HOLO_PART: disj3 {entrance} Default Interpretation: non-exclusive disjunction Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  32. Overview of the Language Internal relations in EuroWordnet Same Part of Speech relations: HYPERONYMY/HYPONYMY car - vehicle ANTONYMY open - close HOLONYMY/MERONYMY head – nose NEAR_SYNONYMY apparatus - machine Cross-Part-of-Speech relations: XPOS_NEAR_SYNONYMY dead - death; to adorn - adornment XPOS_HYPERONYMY/HYPONYMY to love - emotion XPOS_ANTONYMY to live - dead CAUSE die - death SUBEVENT buy - pay; sleep - snore ROLE/INVOLVED write - pencil; hammer - hammer STATE the poor - poor MANNER to slurp - noisily BELONG_TO_CLASS Rome - city Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  33. Co_Role relations criminal CO_AGENT_PATIENT victim novel writer/ poet CO_AGENT_RESULT novel/ poem dough CO_PATIENT_RESULT pastry/ bread photograpic camera CO_INSTRUMENT_RESULT photo guitar player HAS_HYPERONYM player CO_AGENT_INSTRUMENT guitar player HAS_HYPERONYM person ROLE_AGENT to play music CO_AGENT_INSTRUMENT musical instrument to play music HAS_HYPERONYM to make ROLE_INSTRUMENT musical instrument guitar HAS_HYPERONYM musical instrument CO_INSTRUMENT_AGENT guitar player Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  34. Horizontal & vertical semantic relations chronical patient ; mental patient ρ-PATIENT HYPONYM cure patient ρ-CAUSE docter HYPONYM treat ρ-AGENT ρ-PATIENT STATE child docter ρ-LOCATION ρ-PROCEDURE co-ρ- AGENT-PATIENT disease; disorder HYPONYM physiotherapy medicine etc. hospital, etc. stomach disease, kidney disorder, child Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  35. The Multilingual Design • Inter-Lingual-Index: unstructured fund of concepts to provide an efficient mapping across the languages; • Index-records are mainly based on WordNet synsets and consist of synonyms, glosses and source references; • Various types of complex equivalence relations are distinguished; • Equivalence relations from synsets to index records: not on a word-to-word basis; • Indirect matching of synsets linked to the same index items; Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  36. Equivalent Near Synonym • 1. Multiple Targets (1:many) • Dutch wordnet: schoonmaken (to clean) matches with 4 senses of clean in WordNet1.5: • make clean by removing dirt, filth, or unwanted substances from • remove unwanted substances from, such as feathers or pits, as of chickens or fruit • remove in making clean; "Clean the spots off the rug" • remove unwanted substances from - (as in chemistry) • 2. Multiple Sources (many:1) • Dutch wordnet: versiersel near_synonym versiering ILI-Record: decoration. • 3. Multiple Targets and Sources (many:many) • Dutch wordnet: toestel near_synonym apparaat ILI-records: machine; device; apparatus; tool Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  37. Equivalent Hyperonymy Typically used for gaps in English WordNet: • genuine, cultural gaps for things not known in English culture: • Dutch: klunen, to walk on skates over land from one frozen water to the other • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English: • Dutch: kunststof = artifact substance <=> artifact object Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  38. EuroWordNet statistics Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  39. Wordnets as semantic structures • Wordnets are unique language-specific structures: • same organizational principles: synset structure and same set of semantic relations. • different lexicalizations • differences in synonymy and homonymy: • "decoration" in English versus "versiersel/versiering" in Dutch • "bank" in English (money/river) versus "bank" in Dutch (money/furniture) • BUT also different relations for similar synsets Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  40. object artifact, artefact (a man-made object) natural object (an object occurring naturally) block instrumentality body box spoon bag device implement container tool instrument Autonomous & Language-Specific Wordnet1.5 Dutch Wordnet voorwerp {object} blok {block} lichaam {body} werktuig{tool} bak {box} lepel {spoon} tas {bag} Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  41. Linguistic versus Artificial Ontologies • Artificial ontology: • better control or performance, or a more compact and coherent structure. • introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), • neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise). • What properties can we infer for spoons? • spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  42. Linguistic versus Artificial Ontologies Linguistic ontology: • Exactly reflects the relations between all the lexicalized words and expressions in a language. • Captures valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language. What words can be used to name spoons? spoon -> object, tableware, silverware, merchandise, cutlery, Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  43. Wordnets versus ontologies • Wordnets: • autonomous language-specific lexicalization patterns in a relational network. • Usage: to predict substitution in text for information retrieval, • text generation, machine translation, word-sense-disambiguation. • Ontologies: • data structure with formally defined concepts. • Usage: making semantic inferences. Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  44. From EuroWordNet to Global WordNet • EuroWordNet ended in 1999 • Global Wordnet Association was founded in 2000 to maintain the framework: http://www.globalwordnet.org • Currently, wordnets exist for more than 50 languages, including: • Arabic, Bantu, Basque, Chinese, Bulgarian, Estonian, Hebrew, Icelandic, Japanese, Kannada, Korean, Latvian, Nepali, Persian, Romanian, Sanskrit, Tamil, Thai, Turkish, Zulu... • Many languages are genetically and typologically unrelated Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  45. Some downsides of the EuroWordNet model • Construction is not done uniformly • Coverage differs • Not all wordnets can communicate with one another, i.e. linked to different versions of English wordnet • Proprietary rights restrict free access and usage • A lot of semantics is duplicated • Complex and obscure equivalence relations due to linguistic differences between English and other languages Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  46. Fahrzeug 1 Auto Zug 2 vehicle German Words 1 car train 2 English Words 3 3 vehículo 1 auto tren veicolo 1 2 Spanish Words auto treno 2 Italian Words Next step: Global WordNet Grid Inter-Lingual Ontology voertuig 1 auto trein Object 2 liiklusvahend Dutch Words 1 Device auto killavoor TransportDevice 2 Estonian Words véhicule 1 voiture train 2 dopravní prostředník French Words 1 auto vlak 2 Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven Czech Words

  47. GWNG: Main Features • Construct separate wordnets for each Grid language • Contributors from each language encode the same core set of concepts plus culture/language-specific ones • Synsets (concepts) are mapped crosslinguistically via an ontology instead of just the English Wordnet Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  48. The Ontology: Main Features • List of concepts is not just based on the lexicon of a particular language (unlike in EuroWordNet) but uses ontological observations • Ontology contains only upper and mid-level concepts • Concepts are related in a type hierarchy • Concepts are defined with axioms Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  49. The Ontology: Main Features • Minimal set of concepts (Reductionist view): • to express equivalence across languages • to support inferencing • Ontology need not and cannot provide a concept for all concepts found in the Grid languages • Lexicalization in a language is not sufficient to warrant inclusion in the ontology • Lexicalization in all or many languages may be sufficient • Ontological observations will be used to define the concepts in the ontology • Ontological framework still must be powerful enough to encode all concepts that are lexically expressed in any of the Grid languages • Additional lexicalized concepts are related to the ontology through complex relations Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

  50. Ontological observations • Identity criteria as used in OntoClean (Guarino & Welty 2002), : • rigidity: to what extent are properties true for entities in all worlds? You are always a human, but you can be a student for a short while. • essence: what properties are essential for an entity? Shape is essential for a statue but not for the clay it is made of. • unicity: what represents a whole and what entities are parts of these wholes? An ocean is a whole but the water it contains is not. Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

More Related