230 likes | 258 Views
Bulgarian WordNet. Svetla Koeva Institute for Bulgarian Language Bulgarian Academy of Sciences. Bulgarian WordNet. The Bulgarian WordNet (BulNet) has been under development for two years within the framework of the BalkaNet project.
E N D
Bulgarian WordNet Svetla Koeva Institute for Bulgarian Language Bulgarian Academy of Sciences
Bulgarian WordNet • The Bulgarian WordNet(BulNet) has been under development for two years within the framework of the BalkaNet project. • The BalkaNet project (Multilingual Semantic Network for the Balkan Languages), aims to develop a multilingual resource representing semantic relationships in five Balkan languages (Bulgarian, Greek, Serbian, Romanian and Turkish). • Each set of synonymous words in a given language is linked to the closest set in the Princeton WordNet2.0 via its ID number. Second Wordnet Conference, Brno
BulNet – DCMB team • The partners from Bulgarian site are Bulgarian Academy of Sciences and Plovdiv University. • The Bulgarian WordNet is being developed by the Department of Computer Modeling of Bulgarian Language within the Institute for Bulgarian language - Bulgarian Academy of Sciences. http://ibl.bas.bg/departments_en6.htm • The DCMB BulNet team consists of small group of researchers – linguists, computational linguists, logicians and mathematicians. Second Wordnet Conference, Brno
BulNet – current state • The Bulgarian WordNet models nouns, verbs, and adjectives, and contains already 17 291 word senses (towards 20.01.2003), where 31 164literals have been included (the ratio is 1.8). • The distribution of synsets into parts of speech: • Nouns – 12 223 synsets • Verbs – 3 408 synsets • Adjectives – 1 656 synsets • Adverbs – 4 synsets Second Wordnet Conference, Brno
BulNet – current state Second Wordnet Conference, Brno
Completeness • Presence of all members from the chosen up to now Base Concepts within the framework of theBalkaNet project. • Base Concepts 1 (1218 members) • BC2 (3471 members) • BC3 (4855 members) • Lack of any "dangling relations" • Lack of any “gaps” • Presence of an appropriate interpretation definition for each synset Second Wordnet Conference, Brno
Consistency • The are no duplicated literals in a given synset. • There are no identical or almost identical glosses of differentsynsets. • There are no literals that coincide with their glosses. • There are no duplicated relations between two synsets. • Every difference in relations according to EWN is language specific and linguistically grounded. • There are no hypernym cycles, as well as any relation loops inside BulNet. Second Wordnet Conference, Brno
Main achievements • Theoretical linguistic work • Validation tests • Dependencies between relations • Combination of Bulgarian language resources • Descriptive logic • Design and development of tools • WordNet Explorer • WordNet Validator Second Wordnet Conference, Brno
Validation tests • Our approach to validation of WordNets includes three separate levels: • Checking the syntax of the XML files • Completeness checking of WordNets • Checking for consistency in defining the semantic relations and glosses. • Every level is distinguished with: • Different degrees of complexity and significance • Different possibilities for automatic data correction Second Wordnet Conference, Brno
Validation tests • The lowest level, which is also the easiest for processing and correction, is XML fails syntax. • In the following cases automatic checking as well as automatic data correction is possible: • Facultative empty tags • Duplicated literals in a synset • Sense numbers Second Wordnet Conference, Brno
Validation tests • In other cases where automatic correction is possible manual confirmation of replacements is necessary: • Accepted ID standard • Missing values of the obligatory tags • Corespondence of BCS tags • At least one literal in a synset Second Wordnet Conference, Brno
Validation tests • In some cases only validation is possible: • No duplicated <ID> numbers • No duplicated relations between two synsets • No “gaps” • No “dangling relations” • No loops Second Wordnet Conference, Brno
Relations’ dependencies • Description of the dependencies between the relations: • Hyponyms of two antonyms (nouns) should also be antonyms (woman – man; female actor – actor) • Antonyms (nouns) should have equivalent holo_parts: woman -arm, head; man – arm, head. • Hyponym should have the same mero_parts (for concrete nouns} as its hypernym (man – head, arm,… ; woman – head, arm, ..) • Collective nouns that are holo/mero_members should share the same hypernym, not necessarily the immediate one (football team is an organization, as well as football league) • Nouns that are holo/mero_portions should share the same hypernym, not necessarily the immediate one (coffee – substance; caffeine - substance) Second Wordnet Conference, Brno
Combining language resources • Three large Bulgarian resources: • BulNet • Bulgarian Syntax Dictionary – encoding the arguments of the verbs and their semantic features • Bulgarian Grammatical Dictionary – encoding over 83 000 lemmas are their corresponding word forms • Mutual supplement • Expansion of the resources • Validation of the resources • Uniform grammatical characteristics Second Wordnet Conference, Brno
WordNet logic • The DCMB team developed a uniform, efficient and powerful utility system for querying and exploring of WordNet – WordNet logic. • Tailored for the WordNet developers needs • Powerful enough for expressing complex statements and queries • Fully decidable • The formal background consists of WordNet Structure, WN Language, WN Semantics,WN Logic and WN Logic theorems. Tinko Tinchev, Stoyan Mihov, Svetla Koeva, Angel Genov: Logic for WordNet, Annual Journal of Sofia University, 2003 Second Wordnet Conference, Brno
WordNet Validator • The WordNet Validator (WNV) is a Web-based system for validation (and correction) of WordNets completeness and consistency • The WordNet Validator has the following main functions: automatic correction of xml syntax, validation of WordNet completeness and consistency, search for a given synset and visualization of semantic trees. • The WordNet Validator can be used for practical work during constructing monolingual WordNets of Balkan languages as well as for evaluation of the completeness and consistency of different WordNet. Second Wordnet Conference, Brno
Future directions Second Wordnet Conference, Brno