190 likes | 520 Views
Building Methodology. © Arabic WordNet. Methodologies developed in a number of projects. EuroWordNet: English, Dutch, German, French, Spanish, Italian, Czech, Estonian 10,000 up to 50,000 synsets BalkaNet: Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian 10,000 synsets.
E N D
Building Methodology © Arabic WordNet
Methodologies developed in a number of projects • EuroWordNet: • English, Dutch, German, French, Spanish, Italian, Czech, Estonian • 10,000 up to 50,000 synsets • BalkaNet: • Romanian, Bulgarian, Turkish, Slovenian, Greek, Serbian • 10,000 synsets
Main strategies for building wordnets • Expand approach: translate WordNet synsets to another language and take over the structure • easier and more efficient method • compatible structure with WordNet • vocabulary and structure is close to WordNet but also biased by it • Merge approach: create an independent wordnet in another language and align it with WordNet by generating the appropriate translations • more complex and labor intensive • different structure from WordNet • language specific patterns can be maintained
General criteria for approach: • The purpose of the resource: machine translation, cross-lingual information retrieval, deep semantic analysis, domain applications • Available resources for the specific language • Properties of the language • Maximize the overlap with wordnets for other languages • Maximize semantic consistency within and across wordnets • Maximally focus the manual effort where needed • Maximally exploit automatic techniques
Top-down methodology • Develop a core wordnet (5,000 synsets): • all the semantic building blocks or foundation to define the relations for all other more specific synsets, e.g. building -> house, church, school • provide a formal and explicit semantics • Validate the core wordnet: • does it include the most frequent words? • are semantic constraints violated? • Extend the core wordnet: (5,000 synsets or more): • automatic techniques for more specific concepts with high-confidence results • add other levels of hyponymy • add specific domains • add ‘easy’ derivational words • add ‘easy’ translation equivalence • Validate the complete wordnet
Developing a core wordnet • Define a set of concepts(so-called Base Concepts) that play an important role in wordnets: • high position in the hierarchy • high degree of connectivity • represented as English WordNet synsets • Common base concepts: shared by various wordnets in different languages • Local base concepts: not shared • EuroWordNet: 1024 synsets, shared by 2 or more languages • BalkaNet: 5000 synsets (including 1024) • Common semantic framework for all Base Concepts, in the form of a Top-Ontology • Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets) • Manually build and verify the hypernym relations for the Base Concepts • All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
Top-down methodology Top-Ontology 63TCs Hypero nyms Hypero nyms CBC Represen- tatives Local BCs 1024 CBCs CBC Repre-senta. Local BCs WMs related via non-hypo nymy WMs related via non-hypo nymy Remaining WordNet1.5 Synsets First Level Hyponyms First Level Hyponyms Remaining Hyponyms Remaining Hyponyms Inter-Lingual-Index
Global Wordnet Association EuroWordNet BalkaNet • Arabic • Polish • Welsh • Chinese • 20 Indian Languages • Brazilian Portuguese • Hebrew • Latvian • Persian • Kurdish • Avestan • Baluchi • Hungarian • Romanian • Bulgarian • Turkish • Slovenian • Greek • Serbian • English • German • Spanish • French • Italian • Dutch • Czech • Estonian • Danish • Swedish • Portuguese • Korean • Russian • Basque • Catalan • Thai http://www.globalwordnet.org
Core wordnet 5000 synsets = 1000 Synsets 5000 Synsets WordNet Synsets 1045678-v {darrasa} Top-down methodology Hyper nyms Sumo Ontology Arabic word frequency English Arabic Lexicon teach - darrasa CBC SBC ABC EuroWordNet BalkaNet Base Concepts WordNet Synsets 1045678-v {teach} Next Level Hyponyms Arabic roots & derivation rules WordNet Synsets WordNet Domains More Hyponyms Domain “chemics” WordNet Synsets Named Entities Named Entities Easy Translations Domain Arabic Wordnet English Wordnet
Advantages of the approach • Well-defined semantics that can be inherited down to more specific concepts • Apply consistency checks • Automatic techniques can use semantic basis • Most frequent concepts and words are covered • High overlap and compatibility with other wordnets • Manual effort is focussed on the most difficult concepts and words
Overview of equivalence relations to the ILI Relation POS Sources: Targets Example eq_synonym same 1:1 auto : voiture car eq_near_synonym any many : many apparaat, machine, toestel: apparatus, machine, device eq_hyperonym same many : 1 (usually) citroenjenever: gin eq_hyponym same (usually) 1 : many dedo : toe, finger eq_metonymy same many/1 : 1 universiteit, universiteitsgebouw: university eq_diathesis same many/1 : 1 raken (cause), raken: hit eq_generalization same many/1 : 1 schoonmaken : clean
Filling gaps in the ILI Types of GAPS • genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin, • Non-productive • Non-compositional • pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier) • Productive • Compositional • Universality of gaps: Concepts occurring in at least 2 languages
Productive and Predictable Lexicalizations exhaustively linked to the ILI beat hypernym hypernym {doodslaanV}NL {totschlagenV}DE kill hypernym hypernym {doodstampenV}NL {tottrampelnV}DE stamp hypernym {doodschoppenV}NL kick cashier hypernym hypernym {cajeraN}ES in_state {casière}NL in_state female hypernym fish {alevínN}ES in_state young
Top-down methodology Hyper nyms Sumo Ontology = Arabic word frequency English Arabic Lexicon 1000 Synsets SBC CBC ABC EuroWordNet BalkaNet Base Concepts 5000 Synsets Next Level Hyponyms Arabic roots & derivation rules WordNet Synsets WordNet Domains More Hyponyms Domain “chemics” WordNet Synsets Named Entities Named Entities Easy Translations Domain Arabic Wordnet English Wordnet