1 / 20

UGTag : morphological analyzer and tagger for Ukrainian language

UGTag : morphological analyzer and tagger for Ukrainian language. Natalia Kotsyba Andriy Mykulyak Igor V. Shevchenko. UGtag a s a set of NLP tools. developed within the Polish-Ukrainian parallel corpus to provide grammatical annotation for its Ukrainian part

ila-bell
Download Presentation

UGTag : morphological analyzer and tagger for Ukrainian language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UGTag: morphological analyzer and tagger for Ukrainian language Natalia Kotsyba AndriyMykulyak Igor V. Shevchenko

  2. UGtag as a set of NLP tools • developed within the Polish-Ukrainian parallel corpus to provide grammatical annotation for itsUkrainian part • inspired by a functionally similar TaKIPI toolset for Polish • unified output format for both language parts of the corpus • suitable for search with such programmes as Poliqarp Differences: • interactive annotation of texts with manual disambiguation • modular design allows plugging-in additional grammatical dictionaries as well as modification of the existing ones • code for UGTag was written from scratch

  3. Programme architecture

  4. UGTag package • enrichesraw texts with grammatical information taken by default from UGD (Ukrainian Grammatical Dictionary) • data in UGD arestored in a relational database • 180 thousand lemmas • 56 thousand endings • more than 2000 paradigmatic classes • major part of the data was transformed into a set of XML files and adjusted for specific UGTag needs • any compatible dictionary can be used insteadoralong

  5. Stages of analysis • pre-processing stage: tokenization and chunking • morphological tagging • disambiguation

  6. Process of analysis

  7. Premorphological analysis Procedures that do not involve the use of the grammatical dictionary Reading phase and input formats • plain, HTML or XML texts  XML files structured according to the XCES standard • strips all tags from input HTML or XML files and turns them into raw texts • user-defined file readers that take into account logical mark-up of input XML files and incorporate it into the output XML format • file reader separates the external representation of texts from their unified internal representation fed to the tokenizer • extract the text itself, possibly portioning it in chunks for further processing.

  8. Tokenizer • first divides chunks into blocks delimited by whitespace characters • block can consist of one or more tokens, e.g. a quote and a word with no white space in between (”token). • next divides blocks into tokens that are minimal structural units • five categories of tokens: words, numbers, punctuation marks, whitespace characters and unrecognized tokens • word is a sequence of alphabetical characters with an optional hyphen

  9. Grammatical dictionary • structure of grammatical information in UGD was rearranged and further division into finer categories was carried out and implemented to meet the requirements of the intended tagsets: • compatible with MULTEXT-EAST, V.4 • common tagset for Polish and Ukrainian[Kotsyba, Turska, Shypnivska 2008] slightly modified and simplified to achieve this compatibility • the category of degree of comparison for adjectives and adverbs was reintroduced, and adjectives and adverbs were regrouped and relemmatized accordingly • category of predicatives was regrouped based on the conclusions in [Derzhanski, Kotsyba 2008] • word splitting: original UGD collocations with white space characters or hyphens treated as individual units • information about those combinations is preserved and can be used for syntactic analysis in the future

  10. Morphological analysis • users can watch the progress of tagging as it goes • tagged tokens of different categories are displayed in the screen colour coded • unrecognized tokens (red) • wordswith only one available grammatical interpretation(green) • words with multiple grammatical interpretations (blue) • panel in the top right corner displays grammatical characteristics of the selected item • manual disambiguation is possible for words with multiple available interpretations

  11. Automatic disambiguation • rudimentaryautomatic disambiguation based on statistical analysisfor a small but frequently used word class of prepositions • “до” 15 grammatical interpretations, one for preposition and 14 for all possible grammatical characteristics of the invariable noun “до” (musical note) • “на” colloquial use as interjection • further disambiguation policy foresees combination of rules and statistical analysis of manually disambiguated data

  12. Enriching the dictionary database • during annotation UGTag automatically creates a list of words not found in the dictionary and displays it to the user allowing him to add them to one of user dictionaries • list of words not unrecognized by the active built-in dictionaryisdisplayed • user can select a word from this list and add it to the dictionary • programme gives hints as to the paradigm of the word • definition of the wordforms can be done manually

  13. Adding a new word

  14. Sentencing • sentence splittingis rule-based and some of those rules require grammatical information • implemented so far rules are partially based on Rudolf’s work for Polish [Rudolf 2004] • heuristics that use popular abbreviations and words starting with the capital letter, whose meaning is also taken into the account

  15. Writing phase and writing format • two output tag formats for resulting XML files • default format is based on TaKIPI1.8 for Polish, extended for Ukrainian specific features [Kotsyba, Turska, Shypnivska 2008] • retains maximum grammatical information that can be provided by Polish and Ukrainian grammatical dictionaries • MULTEXT-East compatible tagset which a more course granulation of grammatical information, [Derzhanski, Kotsyba2009]

  16. Plans for further development • depend on results of extensiveexperimenting with real corpus texts • enriching the dictionary database using both manual and automatic ways • enhancing the quality of automatic disambiguation • preliminary syntactic parsing, word grouping complex words like numerals: “двадцять три” (twenty three) currently recognized as separate words (“двадцять” and “три”), complex passive structures, prepositional phrases, etc.

More Related