120 likes | 208 Views
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet. Harry Kornilakis 1 , Maria Grigoriadou 1 , Eleni Galiotou 1,2 , Evangelos Papakitsos 1 1 Department of Informatics and Telecommunications, University of Athens, Greece
E N D
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1 , Maria Grigoriadou 1 , Eleni Galiotou 1,2 , Evangelos Papakitsos 1 1 Department of Informatics and Telecommunications, University of Athens, Greece 2 Department of Informatics, Technological Educational Institute of Athens, Athens, Greece {harryk, gregor, egali}@di.uoa.gr, papakitsev@vip.gr
Greek Wordnet Development • Part of the BalkaNet Project • multilingual lexical database with semantic relations for each of the following languages: Bulgarian, Czech, Greek, Romanian, Serbian and Turkish. • The deployment of computational tools and resources has been proven to be of major importance for the development of the monolingual Greek Wordnet (Galiotou et al).
Use of Lemmatizer in Greek Wordnet Development • A lemmatizer for the Greek language has been used as the basis of a number of tools supporting the extraction and processing of linguistic information from dictionaries and corpora. • Most existing lemmatizers for Greek are tools that support specific applications, or parts of systems for full morphological processing that require a large number of lexical resources. In our case it wasn’t possible to use such resource as morphological dictionaries or annotated corpora. • Our design goals • Lemmatizer useful for a number of different tools • Requires as few lexical resources as possible • Computationally efficient.
Modern Greek Language Overview (1/2) • The lemmatizer must take into account the peculiarities of the Greek language • Greek is a highly inflected language • Nouns decline for number and case • Adjective decline for number, case, gender and degree. Each verb has about 70 distinct forms. • Verbs conjugate for voice, mood, tense, aspect, number and person. Each verb has about 60 distinct forms.
Modern Greek Language Overview (2/2) • Word Stress • Each word of two or more syllables has a stressed syllable that is pronounced the loudest, and in written script it is denoted by a stress mark (') over the nuclear vowel of the syllable. • Word stress in Greek is distinguishing (e.g. νόμος ('nomos - law) is different from νομός (no'mos - administrative region). • Word stress is moving i.e. the stress may change its position within the inflectional paradigm of the same word. E.g the word θάλασσα ('θalasa - sea) in the genitive plural case becomes θαλασσών (θala'son - of the seas).
Lemmatizer for the Greek Language • Given a word in Greek as input, the lemmatizer analyzes the word and finds its dictionary citation form. • Lexical Information Required by Lemmatizer • List of the citation forms of words. Our list was compiled from an electronic dictionary and automatically extended with some productive derivations (e.g. diminutives). It contains around 52000 words. • A list containing information about how words are inflected in Greek. Each entry contains information about possible inflectional endings and about stress movement. • List of irregular forms of words. So far this list has about 400 such words.
A Lemmatizer for the Greek Language • (Short) description of the algorithm of the Lemmatizer • First we try to find the input word in the list of citation forms. • Then we try to find the input word in the list of irregular forms. • Then we try to match the ending of the word with the inflectional endings in the list of inflectional information. If an ending is found then it is removed so as to find the stem of the word. The stem is then used to form a number of possible citation forms of the input word. Finally, we search for these words in the list of citation forms and if it is found we consider it as a possible citation form of the input word.
Tools for Wordnet Development and Validation • Lemmatized Word-frequency Counter • Translator of Words from Greek to English • Part of Speech Tagger
Lemmatized Word-frequency Counter • This tool counts the occurrences of words in corpora, regardless of the inflectional type in which they appear. • In Wordnet development, when determining base concepts it is useful to be aware of the frequency of words in corpora, so as to avoid using as base concepts words which might be frequent in English but infrequent in Greek.
Translator of Words from Greek to English • Given a Greek word, this tool finds the English translation of that word based on a bilingual Greek-English dictionary. • Unlike English, Greek is a highly inflected language, so different forms of a word in Greek correspond to the same English word. • The tool first calls the lemmatizer to find the citation form of the word and then looks it up in a bilingual Greek to English dictionary to find its English translation. • In the framework of Wordnet development it is used to find the correspondence of words appearing in Greek corpora to their Inter-Lingual-Index (ILI) numbers or to directly find the equivalent in Princeton WordNet.
Part of Speech Tagger • By adding information about the part of speech of words we extended the lemmatizer into a part of speech tagger for Greek texts. Enhanced with local disambiguation such a POS tagger can handle most tagging problems in the Greek language. • The part of speech tagger was used for the annotation of a Greek language corpus The text of George Orwell's 1984, which contains around 100.000 words was used. This will be used for producing comparative coverage statistics for the wordnets in BalkaNet. 1984 has already been aligned and annotated for the rest of the languages of Balkanet (except Turkish) as part of the Multext-East project (Erjavec et al.)
Conclusions • A lemmatizer is very useful to the processing of a highly inflected language such as Greek. • We can create a cost effective lemmatizer without need for complicated and hard to find (or build) resources. Such a lemmatizer can be used as part of a number of other computational tools for Wordnet development and validation. • We have presented three such tools and their application in the framework of the BalkaNet project.