220 likes | 234 Views
Comments on Czech Morphological Tagset. Karel Pala Comments on Czech Morphological Tagset Centrum zpracování přirozeného jazyka, Fakulta informatiky, Masarykova univerzita Brno. Overview. What annotation? Morphological? Syntactic? Hybrid? What can and should be changed?.
E N D
Comments on Czech Morphological Tagset Karel Pala Comments on Czech Morphological Tagset Centrum zpracování přirozeného jazyka, Fakulta informatiky, Masarykova univerzita Brno
Overview • What annotation? • Morphological? • Syntactic? • Hybrid? • What can and should be changed?
Present morphological annotation • Presentlyusedapproachcanbecharacterized as partial • Complexexpressionsaresimplytakenapartand puttogetheragain • Non compositionalityisignored • Language reality isdistorted • Holisticapproach has to beconsidered • ItincludesmainlyMWES • They display highfrequency
Examples • Karlovy Vary (k1) • vzhledem k (k7 k7) • a to (k8 k3) • jen (k9) • budu číst (k5 k5)
Further examples • adverbs (k6, CzTen 278,172,710 Desam 49 469) • prepositions (k7), • conjunctions (k8, CzTen 447,920,261 Desam 99 432) • particles (k9, CzTen 324,980,597 Desam 52 951) • verbs (k5, CzTen 694,012,081) Desam 126 067, analytic forms)
Conflicts with tagging a to in Desam • a toisannotated in thefollowingway not tagged 13 k8 k3 60 k9 k3 19 k8 k9 3 k9 k95 ------------------------- 100
Tagging errors • Tagging conflicts here would disappear • if we treat (a to (and this)) and similar expressions • as one respective MWE unit • a to in this case should be tagged as conjunction (k8)
A Comment on Verbs • They display the highest frequency among the mentioned parts of speech • Treating them as MWES would require a massive re-tagging • Fortunately, this can be avoided • Verbs do not cause many tagging conflicts • Thus a (dirty) compromise is acceptable
A solution? • It would bedesirable to go through these ambiguities manually • try to disambiguate them • however, it is obviously not real because of their high frequency • perhaps we can tryto collect all these POSs • keep them as the particular listsand making them a part of the database of Majka analyzer. • It is a great challenge to be considered
PA128 Similarity Searching in Multimedia Data (2+2 kr.) Conclusions • Such revision would be painful and expensive • Thiswillmean, in fact, a new project as well • This should lead to essential changes in tagging results and not only for Czech language • The question iswhetherthe techniques exploiting neural networksand ML • will be able to deal with MWES in a holistic and descriptively adequate way