110 likes | 193 Views
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts. Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr. Introduction. morphosyntactic tagging asssigning word categories and subcategories to words in sentence context issues
E N D
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr
Introduction • morphosyntactic tagging • asssigning word categories and subcategories to words in sentence context • issues • modelling sentence context • handling unknown words, dealing with sparse data • common approaches • rule-based, stochastic, hybrid • data-driven models are predominant today • best performing taggers are based on SVM, CRF, HMM
Introduction • data-driven tagging modules • the tagger and the data • data implies tagset encoding word (sub)categories • a solved problem? • state-of-the-art accuracy on English is 97-98% • tagsets for English max. 100 different tags • 1475 different morphosyntactic tags used in the Croatian Morphological Lexicon • accuracy for state-of-the art taggers drops by ca 10%
Tagging Croatian texts • CroTag tagger • inspired by TnT and HunPos • trained on manually MTE v3 annotated 118 kw corpus • accuracy identical to these (96-97% EN, 85-86% HR) • all are highly dependent on unknown word counts • improvements • using the inflectional lexicon to handle unknown words • tagger voting, hibridization?
From another perspective... • goals of tagging • reaching perfect accuracy on full tagset or • making large-scale NLP systems perform better? • specific requirements • users and systems always have them • example: named entity normalization in Croatian Is it Ivo (m.) or Iva (f.) Sanader? • specific tasks may require specific tagset design • keeping speed and memory footprint • reducing tagset size means raising accuracy
Reducing the tagset • MulText East version 3 • positional tagset, letters encode categories • example: Ncmsn = noun, common, masculine, etc. • the subsets 1 – strip non-inflective categories and numerals (800 tags) 2 – strip verbs (739) 3 – strip all but gender, number, case and noun type (243) 4 – remove case category (48) 5 – keep noun type category only (15) 6 – maintain part-of-speech information only (13)
More results • adjectives, nouns and pronouns • most difficultly tagged cattegories for Croatian • combination of frequency and tags used • maybe these are most important to tag accurately? F1-measures on adjectives, nouns and pronouns
Conclusions • results are as expected • reducing tagset size raises tagging accuracy • sacrificing information for efficiency • reductions are illustrative • careful tagset design required with regards to requirements • further work • as mentioned: reaching perfect accuracy on full tagset or making large-scale NLP systems perform better?
Your questions? Computational Linguistic Models and Language Technologies for Croatian rmjt.ffzg.hr | hml.ffzg.hr | hnk.ffzg.hr
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic, zdovedan}@ffzg.hr