250 likes | 385 Views
Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний корпус Polsko-Ukraiński Korpus Równoległy http://corpus.domeczek.pl.
E N D
Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний корпус Polsko-Ukraiński Korpus Równoległyhttp://corpus.domeczek.pl Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences Olga Shypnivska, ULIF, Ukrainian Academy of Sciences Magdalena Turska,Warsaw University
Main objectives and expected applications • at least 3 mln tokens ; representative • sentence-level alignment • morphological annotation with a common tagset • public access; user-friendly • linguistic material for • (independent) language learning • bilingual dictionaries • research on grammar and lexis • translation memory for humans and machines
Search (present) • based on PERLregular expressions • any searched chain has to be “embraced” by “/”. E.g. /Холодна війна/ • special characters: Іalternative; )end of subchain [ i ]beginning and end of a defined character class ? 1 or 0 appearances; * 0 or more appearances + 1 or more appearances \sany empty character \wany letter, digit, underlining sign \bend of word, \ escape
Examples of search formulae /jako/ „jako” /jako\s/ „jako, niejako, dwojako” /\bjako/ „jakość’ /norma\./ „norma” before a dot
Sources of morphological information • Polish: IPI PAN corpus + … • Ukrainian: • grammatical dictionary by ULIF, UAS (Igor Shevchenko) lemma <> wordform • morphological analyzer (information is slightly different, built for homonymy disambiguation) • no lemmatization (so far)
Types of tagsets SYMBOLS: encoding all possible grammatical characteristics of a wordform in one symbolEnglish (BNC), Ukrainian - takes little machine memory but requires too much of the human one CHAINS: contain codes corresponding to particular grammatical categories and/or their values; morphological characteristics of a wordform is represented by a sequence of such codes can be even more economic than symbols, if a query concerns morphological categories owned by several lexico-grammatical classes • positionalCzech every category (and its values) have a fixed position in a chain • flexemicPolish, Russian every category has its own subtagset
Multext-East tagset for En Ro Sl Cz Bg Et Hu Hr Sr Re • chain-like; criticised • 14 PoS:N10, V15, A12, P(ron)17, Det10, T(he)6, adveRb6, S(adposition)4, C(onj)7, nuMeral12, Intjn2, X(residual), Yabbr5, Qparticle3 • only Bg and Hu do not have modal verbs and copulas • En Ro have determiners, Ro Hu Re have articles, Bg – has neither (analitism, segmentation); • Is a Bg noun formally indefinite if the article is attached to the adj? (cf. agglutinativity of Pl być) • negation as morphological category • Cz transgresivity (adverbial participle)
Treatment of participles • Polish (no aspectual characteristics) (Here and further cited by: Adam Przepiórkowski i Marcin WolińskiA Flexemic Tagset for Polish.) • Ukrainian (aspect and tense) Дієслово, дієприслівник, доконаний вид, минулий час, активний стан VWпрочитавши Дієслово, дієприслівник, недоконаний вид, теперішній час, активний стан UQчитаючи (Here and further cited by: ШироковВ.А et al. Корпусна лінгвістика.) • PolUKR participle I (doing/having done) characterised by aspect
Treatment of pronouns • notorious Slavonic pronoun problem: 296 unique tags for 309 pronouns • Polish: division into 1-2 p, 3p and siebie (ów, jak?) • Ukrainian: pro-noun, pro-adjective • Russian: also pro-predicative and pro-adverb • Czech: many subcategories on the level of SubPoS • PolUKR: Ua approach and Pl division into 1-2 and 3 person
Treatment of predicatives • Polish: adverbs with modal semantics like można, trzeba (it is) allowed/one can, (it is) necessary, ?to • Ukrainian (code X0) includes adverbs of state like жарко, шкода, жаль(it is) hot, (it is) a pity • PolUKR moving the category from the morphological level to the semantic one
Search engine for PolUKR • choose the direction of the search (Ua>Pl or Pl<Ua) • search conditions for both languages (RvonW) • 3 levels of search: • exact form • (lemma) with the morphological choice • using Poliqarp-like tag formulas (for advanced users) • idea of subcategories (either a POS or a SUBPOS can be selected, but not both; similarly, one cannot select all subcategories of a POS), cf. aliases in IPI PAN corpus • alternative is ensured through tick-off boxes, so that one can choose EITHER „VERBfinite past” OR „NOUNdative neutral” OR sth else, etc.) • restrictions on choice within 1 of 10 POS
Literature • INTERA unified tagset projectwww.elda.org/intera • Tomas Erjavec et al. Multext-East specifications for Slavic languages, Budapest, 2003. • Jan Hajič. Positional Tags: Quick Reference (Czech „HM” Morphology), 2000. • Adam Przepiórkowski and Marcin Woliński. A Flexemic Tagset for Polish. In: The Proceedings of the Workshop on Morphological Processing of Slavic Languages, EACL 2003. http://nlp.ipipan.waw.pl/~adamp/Papers/2003-eacl-ws12/ws12.pdf • Elena Paskaleva. Balcan South-East Corpora Aligned to English. In: The Proceedings of the Workshop on Common Natural Language Processing Paradigm for Balkan Languages, EACL 2003 • ШироковВ.А et al. Корпусна лінгвістика. Київ: Довіра, 2005.