140 likes | 306 Views
Application of INTEX in refinement and validation of Serbian WordNet. Ivan Obradovi ć, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University of Belgrade. WordNet (WN).
E N D
Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University of Belgrade
WordNet (WN) • a semantic network of concepts represented by synsets – sets of synonymous words (nouns, verbs, adjectives & adverbs) • contains explicitly coded descriptions of semantic relations • inspired by research in the field of psycholinguistics • initially developed at Princeton for the English language Fellbaum C. (ed.), (1998) WordNet: An Electronic Lexical Database, The MIT Press
Multilingual WordNets • Featuring: the InterLingual Index (ILI) • EuroWordNet (EWN): Dutch, Italian, Spanish, German, French, Czech and Estonian • BalkaNet (BWN) five Balkan languages: Greek, Turkish, Bulgarian, Romanian and Serbian, as well as Czech Vossen, P. (ed.) (1998) EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Academic Publishers, Dordrecht Stamou S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov G., Dutoit D., Grigoriadou M. (2002) BALKANET: A Multilingual Semantic Network for Balkan Languages, 1st International Wordnet Conference, Mysore, India, January 2002 (http://www.ceid.upatras.gr/Balkanet/files/balkanet-elsnet-ko-accept.pdf)
The WN semantic network • based on a grouping of synonyms into synsets - representing network nodes • nodes are interconnected by arcs which describe particular semantic relations (hyperonymy, hyponymy, antonymy etc.) • in general, every synset is accompanied by a definition (gloss) and examples of usage that specify the meaning of the concept represented by the synset • the semantic network itself is an XML-document with a precisely established set of entities
The Serbian version of WN • developed starting from the base concepts of the English WN using existing English/Serbian dictionaries in paper form • synset elements represented as the elements in DELAS or DELAC dictionaries without any additional morphosyntactic information • lexical meanings in Serbian coded with reference to the dictionary of Matica Srpska
XML representation of a synset in Serbian WN (demonstrate, establish, prove, show) <SYNSET><ID>ENG171-00528591-v</ID> <SYNONYM> <LITERAL> dokazati <SENSE> 1 </SENSE> </LITERAL> <LITERAL> dokazivati <SENSE> 1 </SENSE> </LITERAL> <LITERAL> pokazati <SENSE> 3 </SENSE> </LITERAL> <LITERAL> pokazivati <SENSE> 3 </SENSE> </LITERAL></SYNONYM> <DEF> Utvrditi valxanost necyega, primerom, objasxnxenxem ili eksperimentom. (Establish the validity of something by example, explanation or experiment)</DEF> <USAGE> Anketa je pokazala da u tako nesxto veruje mali broj ispitanih. (The poll showed that few people believe in this)</USAGE> <POS>v</POS> <ILR>ENG171-00529622-v <TYPE>hypernym</TYPE></ILR><BCS>1</BCS> <STAMP>Dusko 2003/04/21</STAMP> </SYNSET>
Problems in Serbian WN that might be solved using INTEX • lack of morphological and syntactic information related to lexemes • absence of precise criteria for the selection of lexemes for a particular synset • lack of information on relative relevance of each lexeme in a synset in terms of its lexical frequency
Incorporation of morphosyntactic information into synsets using INTEX The DictWNSrp program • matches literals in WN with literals in selected Delas dictionaries and extracts morphosyntactic information from dictionaries • assigns morphosyntactic information to WN literals in cases of a 1-1 match • offers the user the option to confirm or alter the assigned information and resolve cases of homography (e.g. multiple matches) • transfers confirmed morphosyntactic information into the WN using the LNOTE element
XML representation of a synset with assigned morphosyntactic information <SYNONYM> <LITERAL>dokazati <SENSE>1</SENSE> <LNOTE>V122+Perf+Tr+Iref+Ref</LNOTE> </LITERAL> <LITERAL>dokazivati <SENSE>1</SENSE> <LNOTE>V18+Imperf+Tr+Iref</LNOTE></LITERAL> <LITERAL>pokazati <SENSE>3</SENSE> <LNOTE>V122+Perf+Tr+Iref+Ref</LNOTE></LITERAL> <LITERAL>pokazivati <SENSE>3</SENSE> <LNOTE>V18+Imperf+Tr+Iref</LNOTE></LITERAL> </SYNONYM>
Validation of lexemes from a synset on a corpus Phase One: The IntexWN program • selects and displays all synsets from WN for a given lexeme • constructs Intex graphs for all lexemes from selected synsets Phase Two: INTEX • produces concordances from a chosen corpus for graphs constructed by IntexWN Phase Three: User • checks the validity of synonymous relations of lexemes on concordances • decides on removing or adding new lexemes to the synset
Constructing a graph for all lexemes from a synset with the IntexWN program
Validation results for synset ENG171-11771798(being, beingness, existence) • Comments: • the lexemes used in the synset have been used to denote the given concept in 24% of concordances • the lexeme most frequently used to denote the given concept is postojanxe • although zxivot is the most frequent lexeme in the synset, it has been used to denote the given concept only in 10% of cases • bivstvo does not occur in the corpus and its exclusion from the synset could be considered if a similar result is obtained on a wider corpus
Further developments • definition of more precise criteria for validation of lexemes in a synset based on their occurrence in corpora • investigation of possibilities for introducing relevance information in synsets • further development of the IntexWN program to include semantic relations, such as hyponymy/ hyperonymy etc. • introduction of near-synonym information into the Serbian WN using INTEX dictionaries (e.g. augmentatives/diminutives) • investigation of possibilities for introducing multi-lingual features into INTEX using the WN (to be used for parallel corpora)