280 likes | 488 Views
Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words. Gertrud Faa ß faaszgd@ims.uni-stuttgart.de Ulrich Heid heid@ims.uni-stuttgart.de E lsab é Taljard elsabe.taljard@up.ac.za DJ Prinsloo danie.prinsloo@up.ac.za. This Talk. Prologue
E N D
Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words Gertrud Faaß faaszgd@ims.uni-stuttgart.de Ulrich Heid heid@ims.uni-stuttgart.de ElsabéTaljard elsabe.taljard@up.ac.za DJ Prinsloo danie.prinsloo@up.ac.za
This Talk • Prologue • Challenges for tagging Sotho texts • Objectives • Descriptive state of the artfor tagging of Sotho texts • Tools • Tagsets • The ambiguity problem • Methodology • Results • Conclusions & future work
Nine Official Bantu Languages of SA • Sotho Group • Northern Sotho / Sepedi • Tswana • Southern Sotho • Nguni Group • Zulu • Swati • Xhosa • Ndebele ********************* • Venda and Tsonga
Concordial agreement – Northern Sotho Taljard and Bosch (2005)
Challenges for tagging • Ambiguity, for example: • function words: -a- being 9-ways ambiguous, -go- up to30(11,6,5,…)-ways • Unknown words (N+V) • noun derivation: toropo (town) -> toropong (in/at/to town) • verb derivation: next slides
Challenges: unknown words • Agglutinating languages: extensive use of affixes • Example: rekišeditšwe ‘was / were sold for’ < rek- ‘buy’ (verb root) + -iš- (causative) + -el- (applied) + -il- (past tense) + -w- (passive) + -e (inflectional ending)
Examples of suffixes and combinations for a single verb • ROOTetšane, ROOTetšanwa, ROOTetšanwe, ROOTiša, ROOTišitše, ROOTišwa, ROOTišitšwe, ROOTišana, ROOTišane, ROOTišanwa, ROOTišanwe, ROOTišega, ROOTišegile, ROOTišetša , ROOTišeditše, ROOTišetšwa, ROOTišeditšwe, ROOTišetšana, ROOTišetšane, ROOTišetšanwa, ROOTišetšanwe, ROOTišiša, ROOTišišitše, ROOTišišwa, ROOTišišitšwe, ROOTišišana, ROOTišišane, ROOTišišanwa, ROOTišišanwe, ROOToga, ROOTogile, ROOTogwa, ROOTogilwe, ROOTogana, ROOTogane, ROOToganwa, ROOToganwe, ROOTogela, ROOTogetše, ROOTogelwa, ROOTogetšwe, ROOTola, ROOTotše, ROOTolwa, ROOTotšwe, ROOTolana, ROOTolane, ROOTolanwa, ROOTolanwe, ROOTolega, ROOTolegile, ROOTolela, ROOToletše, ROOTolelwa, ROOToletšwe, ROOTolelana, ROOTolelane, ROOTolelanwa, ROOTolelanwe, ROOTolla, ROOTolotše, ROOTollwa, ROOTolotšwe, ROOTollana, ROOTollane, ROOTollanwa, ROOTollanwe, ROOTollega, ROOTollegile, ROOTollela, ROOTolletše, ROOTollelwa, ROOTolletšwe, ROOTollelana, ROOTollelane, ROOTollelanwa, ROOTollelanwe, ROOTolliša, ROOTollišitše, ROOTollišwa, ROOTollišitšwe,ROOTollišana, ROOTollišane, ROOTollišanwa, ROOTollišanwe, ROOTologa, ROOTologile, ROOTologana, ROOTologane, ROOTologanwa, ROOTologanwe, ROOTološa, ROOTološitše, ROOTološwa, ROOTološitšwe, ROOTološana, ROOTološane, ROOTološanwa, ROOTološanwe, ROOTološetša, ROOTološeditše, ROOTološetšwa, ROOTološeditšwe, ROOTološetšana, ROOTološetšane, ROOTološetšanwa, ROOTološetšanwe, ROOToša, ROOTošitše, ROOTošwa, ROOTošitšwe, ROOTošetša, ROOTošeditše, ROOTošetšwa, ROOTošeditšwe, ROOTošetšana, ROOTošetšane, ROOTošetšanwa, ROOTošetšanwe
Solution for unknown verbs and nouns • Verb guesser: detection of • longest match suffix combinations • occurrences in corpora • Noun guesser: matching of • singular/plural-forms • nominal suffixes • occurrences in corpora
Objectives • Tagging with a detailed tagset: class numbers • Nouns, adjectives, pronouns, concords, demonstratives • Disambiguation • Motivation: tagging used as preprocessing for: • Chunking, parsing • Lexicography (tag relatively large corpora,e.g. PSC) • Detailed linguistic research (e.g. grammar development) • Information extraction
State of the art for tagging: Sotho languages • Comparison of tagsets and tools is hardly possible • Different applications of tagged material(linguistic description, lexicography, parsing, etc.) • Different number of tags • Differences in granularity
Descriptive State of the Art for tagging: Sotho languages Tools: • Full • De Schryver and de Pauw (2007)Northern Sotho tagger (statistical) • Partial • Kotzé (several publications, e.g. 2008) Verbal and nominal segment(finite state)
Descriptive state of the art for tagging: Sotho languages Applications of tagsets: • De Schryver and de Pauw (2007):used for lexicography • Van Rooy and Pretorius (2003):linguistic description of Setswana • Taljard et al. (2008): morphosyntactic and general linguistic description
The ambiguity problem • -a-, -go-: see handout for possible readings • Local context may not identify noun class of subject concord:(Masogana) … A nwa bjalwa CS06 drink beer(Young men) … “They drink beer.”
The ambiguity problem: possible solutions • Dependent on objectives • Flat tagset ignoring irrelevant details(cf. handout for -go-) • Layered tagset: granularity
Tagset (cf. Handout) • Level 1 • Noun = (N) • Subject concord (CS), Object concord (CO) • Pronouns (PRO) • Level 2 • emphatic (only for pronouns) EMP • possessive (dto.) POSS • Level 3 • Classes -> N.01a, N.01, N.02, N.03, … , PERS, etc. • Example:noun of class 1 = N.01possessive pronoun of class 6 = PRO.POSS.06
RF tagger technology(cf. Schmid and Laws (2008) • Hidden Markov Model (HMM) Tagger • Additional external lexicon • Large, fine-grained tagsets • Several levels of description: e.g. German articles: ART.Definiteness.Case.Number.Gender • Calculates joint (product) probabilities
Training corpus • 45,000 tokens manually annotated word forms from two text types • Not balanced (25,000 tokens out of a novel, 2 times 10,000 tokens out of dissertations)
Comparing taggers on manually annotated data • Tree-Tagger (Schmidt 1994) • TnT Tagger (Brants 2000) • MBT Tagger (Daelemans et al. 2007) • RF-Tagger (Schmid and Laws 2008)
Effects of size of training corpus No more adding of training data necessary
Effects of highly polysemous function words • Distribution problem • Probability guesses for scarce labels become unreliable • a : • PART (45) vs. CS.01 (1,182) • 91% incorrect labeling of PART. • Detailed discussion: • Handout: -a- refer to pages 2, 4
Alternative proposal: hybrid taggers Spoustová et al. (2007) • Combine rule-based tagging with statistical tagging For Northern Sotho: - Contextual disambiguation works fine with RF-tagger if unambiguous indicators are available • Disambiguating macros (using the same indicators) hence have little effect • Ambiguous contexts hard to account for either way: need for parsing?
Results: 10-fold cross validation • Without guessers (to simulate similar conditions for TnT and MBT) • RF-tagger: 91.00% • TnT tagger: 91.01% • MBT: 87.68% • with guessers: (several thousand nouns and verbs part of the lexicon) • Tree-tagger: 92.46% • RF-tagger: 94.16%
Conclusions • Different intended uses lead to different tagsets (granularity, number of tags) • Including noun class information is essential for general linguistic research, e.g. grammar development, applications of chunking/parsing • RF-Tagger performs well for our layered tagset with the existing amount of training data (45,000), over 94% correct • Ambiguous contexts and sparse data problem combined lead to a high error rate for statistical parsing - not likely to be solvable with macros • Chunking / Parsing might lead to a more adequate solution for this problem
Future work • Apply RF-tagger to the PSC corpus • Evaluate results • Instead of preprocessing rules, a partial postprocessing may make sense (e.g. chunking, parsing)