Comments on Czech Morphological Tagset

Comments on Czech Morphological Tagset Karel Pala Comments on Czech Morphological Tagset Centrum zpracování přirozeného jazyka, Fakulta informatiky, Masarykova univerzita Brno

Overview • What annotation? • Morphological? • Syntactic? • Hybrid? • What can and should be changed?

Present morphological annotation • Presentlyusedapproachcanbecharacterized as partial • Complexexpressionsaresimplytakenapartand puttogetheragain • Non compositionalityisignored • Language reality isdistorted • Holisticapproach has to beconsidered • ItincludesmainlyMWES • They display highfrequency

Examples • Karlovy Vary (k1) • vzhledem k (k7 k7) • a to (k8 k3) • jen (k9) • budu číst (k5 k5)

Further examples • adverbs (k6, CzTen 278,172,710 Desam 49 469) • prepositions (k7), • conjunctions (k8, CzTen 447,920,261 Desam 99 432) • particles (k9, CzTen 324,980,597 Desam 52 951) • verbs (k5, CzTen 694,012,081) Desam 126 067, analytic forms)

Conflicts with tagging a to in Desam • a toisannotated in thefollowingway not tagged 13 k8 k3 60 k9 k3 19 k8 k9 3 k9 k95 ------------------------- 100

Tagging errors • Tagging conflicts here would disappear • if we treat (a to (and this)) and similar expressions • as one respective MWE unit • a to in this case should be tagged as conjunction (k8)

A Comment on Verbs • They display the highest frequency among the mentioned parts of speech • Treating them as MWES would require a massive re-tagging • Fortunately, this can be avoided • Verbs do not cause many tagging conflicts • Thus a (dirty) compromise is acceptable

A solution? • It would bedesirable to go through these ambiguities manually • try to disambiguate them • however, it is obviously not real because of their high frequency • perhaps we can tryto collect all these POSs • keep them as the particular listsand making them a part of the database of Majka analyzer. • It is a great challenge to be considered

PA128 Similarity Searching in Multimedia Data (2+2 kr.) Conclusions • Such revision would be painful and expensive • Thiswillmean, in fact, a new project as well • This should lead to essential changes in tagging results and not only for Czech language • The question iswhetherthe techniques exploiting neural networksand ML • will be able to deal with MWES in a holistic and descriptively adequate way

Thanks for your attention

Comments on Czech Morphological Tagset

Comments on Czech Morphological Tagset

Presentation Transcript

Comments on

Comments on RDA

Comments on planning

Comments on homework

Comments On Theories

Comments on Relay

Comments on O’Rourke

Comments on language...

Comments on

Comments on

Comments on Supernovae

SANERI Comments on

Comments on paper

Comments on Levy’s

Comments on NNLO

COMMENTS ON DEDOLLARIZATION

Comments on schedule

Comments on Beowulf

SANERI Comments on

Comments on