1 / 22

Comments on Czech Morphological Tagset

Comments on Czech Morphological Tagset. Karel Pala Comments on Czech Morphological Tagset Centrum zpracování přirozeného jazyka, Fakulta informatiky, Masarykova univerzita Brno. Overview. What annotation? Morphological? Syntactic? Hybrid? What can and should be changed?.

barkerj
Download Presentation

Comments on Czech Morphological Tagset

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comments on Czech Morphological Tagset Karel Pala Comments on Czech Morphological Tagset Centrum zpracování přirozeného jazyka, Fakulta informatiky, Masarykova univerzita Brno

  2. Overview • What annotation? • Morphological? • Syntactic? • Hybrid? • What can and should be changed?

  3. Present morphological annotation • Presentlyusedapproachcanbecharacterized as partial • Complexexpressionsaresimplytakenapartand puttogetheragain • Non compositionalityisignored • Language reality isdistorted • Holisticapproach has to beconsidered • ItincludesmainlyMWES • They display highfrequency

  4. Examples • Karlovy Vary (k1) • vzhledem k (k7 k7) • a to (k8 k3) • jen (k9) • budu číst (k5 k5)

  5. Further examples • adverbs (k6, CzTen 278,172,710 Desam 49 469) • prepositions (k7), • conjunctions (k8, CzTen 447,920,261 Desam 99 432) • particles (k9, CzTen 324,980,597 Desam 52 951) • verbs (k5, CzTen 694,012,081) Desam 126 067, analytic forms)

  6. Conflicts with tagging a to in Desam • a toisannotated in thefollowingway not tagged 13 k8 k3 60 k9 k3 19 k8 k9 3 k9 k95 ------------------------- 100

  7. Tagging errors • Tagging conflicts here would disappear • if we treat (a to (and this)) and similar expressions • as one respective MWE unit • a to in this case should be tagged as conjunction (k8)

  8. A Comment on Verbs • They display the highest frequency among the mentioned parts of speech • Treating them as MWES would require a massive re-tagging • Fortunately, this can be avoided • Verbs do not cause many tagging conflicts • Thus a (dirty) compromise is acceptable

  9. A solution? • It would bedesirable to go through these ambiguities manually • try to disambiguate them • however, it is obviously not real because of their high frequency • perhaps we can tryto collect all these POSs • keep them as the particular listsand making them a part of the database of Majka analyzer. • It is a great challenge to be considered

  10. PA128 Similarity Searching in Multimedia Data (2+2 kr.) Conclusions • Such revision would be painful and expensive • Thiswillmean, in fact, a new project as well • This should lead to essential changes in tagging results and not only for Czech language • The question iswhetherthe techniques exploiting neural networksand ML • will be able to deal with MWES in a holistic and descriptively adequate way

  11. Thanks for your attention

More Related