UGTag : morphological analyzer and tagger for Ukrainian language

UGTag: morphological analyzer and tagger for Ukrainian language Natalia Kotsyba AndriyMykulyak Igor V. Shevchenko

UGtag as a set of NLP tools • developed within the Polish-Ukrainian parallel corpus to provide grammatical annotation for itsUkrainian part • inspired by a functionally similar TaKIPI toolset for Polish • unified output format for both language parts of the corpus • suitable for search with such programmes as Poliqarp Differences: • interactive annotation of texts with manual disambiguation • modular design allows plugging-in additional grammatical dictionaries as well as modification of the existing ones • code for UGTag was written from scratch

Programme architecture

UGTag package • enrichesraw texts with grammatical information taken by default from UGD (Ukrainian Grammatical Dictionary) • data in UGD arestored in a relational database • 180 thousand lemmas • 56 thousand endings • more than 2000 paradigmatic classes • major part of the data was transformed into a set of XML files and adjusted for specific UGTag needs • any compatible dictionary can be used insteadoralong

Stages of analysis • pre-processing stage: tokenization and chunking • morphological tagging • disambiguation

Process of analysis

Premorphological analysis Procedures that do not involve the use of the grammatical dictionary Reading phase and input formats • plain, HTML or XML texts  XML files structured according to the XCES standard • strips all tags from input HTML or XML files and turns them into raw texts • user-defined file readers that take into account logical mark-up of input XML files and incorporate it into the output XML format • file reader separates the external representation of texts from their unified internal representation fed to the tokenizer • extract the text itself, possibly portioning it in chunks for further processing.

Tokenizer • first divides chunks into blocks delimited by whitespace characters • block can consist of one or more tokens, e.g. a quote and a word with no white space in between (”token). • next divides blocks into tokens that are minimal structural units • five categories of tokens: words, numbers, punctuation marks, whitespace characters and unrecognized tokens • word is a sequence of alphabetical characters with an optional hyphen

Grammatical dictionary • structure of grammatical information in UGD was rearranged and further division into finer categories was carried out and implemented to meet the requirements of the intended tagsets: • compatible with MULTEXT-EAST, V.4 • common tagset for Polish and Ukrainian[Kotsyba, Turska, Shypnivska 2008] slightly modified and simplified to achieve this compatibility • the category of degree of comparison for adjectives and adverbs was reintroduced, and adjectives and adverbs were regrouped and relemmatized accordingly • category of predicatives was regrouped based on the conclusions in [Derzhanski, Kotsyba 2008] • word splitting: original UGD collocations with white space characters or hyphens treated as individual units • information about those combinations is preserved and can be used for syntactic analysis in the future

Morphological analysis • users can watch the progress of tagging as it goes • tagged tokens of different categories are displayed in the screen colour coded • unrecognized tokens (red) • wordswith only one available grammatical interpretation(green) • words with multiple grammatical interpretations (blue) • panel in the top right corner displays grammatical characteristics of the selected item • manual disambiguation is possible for words with multiple available interpretations

Automatic disambiguation • rudimentaryautomatic disambiguation based on statistical analysisfor a small but frequently used word class of prepositions • “до” 15 grammatical interpretations, one for preposition and 14 for all possible grammatical characteristics of the invariable noun “до” (musical note) • “на” colloquial use as interjection • further disambiguation policy foresees combination of rules and statistical analysis of manually disambiguated data

Enriching the dictionary database • during annotation UGTag automatically creates a list of words not found in the dictionary and displays it to the user allowing him to add them to one of user dictionaries • list of words not unrecognized by the active built-in dictionaryisdisplayed • user can select a word from this list and add it to the dictionary • programme gives hints as to the paradigm of the word • definition of the wordforms can be done manually

Adding a new word

Sentencing • sentence splittingis rule-based and some of those rules require grammatical information • implemented so far rules are partially based on Rudolf’s work for Polish [Rudolf 2004] • heuristics that use popular abbreviations and words starting with the capital letter, whose meaning is also taken into the account

Writing phase and writing format • two output tag formats for resulting XML files • default format is based on TaKIPI1.8 for Polish, extended for Ukrainian specific features [Kotsyba, Turska, Shypnivska 2008] • retains maximum grammatical information that can be provided by Polish and Ukrainian grammatical dictionaries • MULTEXT-East compatible tagset which a more course granulation of grammatical information, [Derzhanski, Kotsyba2009]

Plans for further development • depend on results of extensiveexperimenting with real corpus texts • enriching the dictionary database using both manual and automatic ways • enhancing the quality of automatic disambiguation • preliminary syntactic parsing, word grouping complex words like numerals: “двадцять три” (twenty three) currently recognized as separate words (“двадцять” and “три”), complex passive structures, prepositional phrases, etc.

UGTag : morphological analyzer and tagger for Ukrainian language