180 likes | 356 Views
WordNet, EuroWordNet, Balkanet. Faculty of Informatics MU Karel Pala pala@fi.muni.cz. Overview. Starting points What is WordNet? EuroWordNet 1, 2 Balkanet Cz. WordNet Tools. Starting points I. G. A. Miller – founder of psycholexicology, Model of human lexical memory, associations
E N D
WordNet, EuroWordNet, Balkanet Faculty of Informatics MU Karel Pala pala@fi.muni.cz
Overview • Starting points • What is WordNet? • EuroWordNet 1, 2 • Balkanet • Cz. WordNet • Tools
Starting points I • G. A. Miller – founder of psycholexicology, • Model of human lexical memory, associations • Hierarchical organizations of nouns in human memory • A canary can sing. x A canary can fly. x A canary has skin. • Canary – can sing – time t1 (answer: true) • Bird – can fly – time t2, (answer: true) • Animal – has skin – time t3, where t1<t2 <t3 • Generic information is not stored redundantly
Starting points II • Humans easily process anaphoric expressions • He has a rifle, but this weapon has never been used. • Alphabetic vs. hierachical ordering entries in dictionaries, capturing semantic relations • Genus proximum (hypero/hyponyms), siblings (tree vs. pine, oak, beech, fir, spruce, lime tree) • Machine-readable dictionaries – problems with data organization dat – alphabetic ordering separates pieces of information belonging naturally together: dog, coyote, hyena • Lexical databases (WordNet), thesauri (Roget) vs. classical or standard dictionaries
Princeton WordNet 1.5-2.0 • Net of words, English as first language: WN v.1.5, 1.7, 2.0, approx. 100 tis. synsets • Nouns (60 thous.), verbs (11 thous.), adjectives, adverbs, function expressions (synsemantic ex.) • Synsets:[(List of synonyms), (POS), (Gloss), (Semantic Relations), ID], {driver:1 (n), the operator of a motor vehicle, H/H, ID:ENG20-09277009-n} • Semantic relationsbetween synsets: synonymy, hypero/hyponymy, antonymy, holo/meronymy (lexical system with inheriting), large network • top hyperonyms – 25, later Top Onto – 63, BCs:1053 • Nodes in H/H trees can be understood as semantic features, up to 13levels with nouns, about 6 for verbs
Princeton WordNet II • PWN has been developed by G.A. Miller and his group in Princeton • It is free and exists for all platforms and can be downloaded at the address: clarity.princeton.edu • Simple browser allowing to export selected data for further processing can be downloaded as well • Standard database format – now it is possible to convert it into XML format and use VisDic (www pages FI MU) • PWN as such is not based on any corpus data, this negatively influences sense discrimination, it was done introspectively.
EuroWordNet 1 and 2 • New features in comparison with PWN: - multilinguality: 8 languages – En, It, Du, Sp, Ge, Fr, Esto, Cz, Interlingual Index (ILI) - Top ontology (TO, 63 beginners), the set of Base Concepts (1053) - internal language relations (ILR), semantic roles Ag, Pat, Instr, Loc, …, with synsets. - browser and editor – Polaris 1.5 (licensed), free: browser Periscope, ELDA/ELRA CD - example: Top Ontology scheme
EuroWordNet II • Building the individual WordNets using the set of Base Concepts (BCs) • Translation equivalents and lexical gaps – Interlingual Index (ILI) • Problems with too fine grained sense discrimination in PWN: - e.g. verb toget has about 35 senses in PWN - in NODE only basic 8 (+ subsenses) • Problems with typological differences between languages: verb aspect, diminutives, prefixation, virtual (empty) nodes
Balkanet • Continuation of EWN, next 5 languages being added: Gr, Turk, Ro, Bg, S-Cr, (Cz continues), 2001-04 • New features in comparison with EWN: larger set of BCs – up to 8000 synsets, more stress put on capturing the differences between languages • New data representation – using XML format, serious approach to standardization • New tool – editor and browser VisDic (by FI MU) • More attention has been paid to data validation, particularly, multilingual corpus 1984 (Orwell) has been used for this purpose.
Czech WordNet • Synsets so far do not contain Czech glosses (definitions), the English ones are used, they will be added • Verb synsets are supplemented with verb frames that are associated with the individual senses • V. frames contain the surface (cases) valencies and also deep valencies containing the general semantic roles such as AG, PAT, ADR together with the selectional constraints exploiting literals taken directly from PWN 2. 0, for example: [{obléci, obléct, obléknout} kdo1*AG(person:1)=co4*ART(garment:1)] • There is an attempt to exploit the verb semantic classes introduced by Levin (1993), for each verb a number of the semantic class is given, • At the moment it includes approx.3500 sloves (Cz, Eng).
Tools • VisDic – local tool with journaling, XML format, basic unit: synset, consists of the literals - main functions: browsing, editing, export, projection • New tool DEB – client/server arch., XML formats, - basic unit: literal, capturing relations between literals, integration with morphological analyzer Ajka and corpus manager Bonito and other modules or resources • Morfological module (for Czech) Ajka • Interface SAFT – integrating Czech WordNet with morfological analyzer Ajka – possibility to process free (corpus) text • Working on the integration with the partial parser DIS/VADIS, it will be possible to exploit lexical information and semantic features in WordNet during syntactic analysis.
Applications of WNs • Machine Translation – WNs can be used as a new type of dictionary thanks to synsets (synonymy relation and H/H relation • IE – information extraction, allows to follow semantic relations in text, and exploit multilinguality • Useful with web browsers: synonyms and H/H relations, experiments show improvement from approx. – 13 % without WN to approx. 60 % (after query extension, experiments for English only) • Word Sense Discrimination – as a data resource for sense recognition • Knowledge representation, inference relying on word meanings, relations to Semantic Web
Literature • G. A. Miller et al, Five Papers on WN, 1993, rev.version, clarity.princeton.edu, • EuroWordNet, final report, CD ROM with data, 1999, www pages EWN, distributed by ELDA/ELRA (Paris) • P. Vossen et al., EuroWordNet, book publ. by Kluwer • www pages of Global WordNet Association (GWA, P. Vossen, Ch. Fellbaum) • www pages of Balkanet Project, Final report 2004 • www pages of Second Global WordNet Conference, Brno, 20.-23.1.2004 • www pages – NLP Lab. FI MU in Brno, VisDic page.
Top hyperonyms in WordNet 1.5 • act, action, activity (činnost, aktivita) natural object (fyzický objekt) • animal, fauna (zvíře, fauna) natural phenomenon (přírodní jev) • artefakt (výtvor, výrobek) person, human being (osoba, lidská bytost) • attribute, property (atribut, vlastnost) plant, flora (rostlina, flora) • body, corpus (tělo, těleso) possession (vlastnictví) • cognition, knowledge (znalost, poznání) process (proces) • communication (komunikace, sdělování) quantity, amount (kvantita, množství) • event, happenning (událost) relation (vztah) • feeling, emotion (pocit, emoce) shape (podoba, tvar) • food (potrava, jídlo) state, condition (stav) • group, collection (skupina, soubor) substance (substance, látka ) • location, place (umístění, místo) time (čas) • motive (motiv)