1 / 38

Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text

Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text. Silke Scheible , Richard Jason Whitt, Martin Durrell , and Paul Bennett. The GerManC project School of Languages, Linguistics, and Cultures University of Manchester (UK). Overview. Motivation The GerManC corpus

dasan
Download Presentation

Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text SilkeScheible, Richard Jason Whitt, Martin Durrell, and Paul Bennett The GerManC project School of Languages, Linguistics, and Cultures University of Manchester (UK)

  2. Overview • Motivation • The GerManC corpus • POS-tagger and tagset • Challenges • Results

  3. Motivation • Goal: • POS-tagged version of GerManC corpus

  4. Motivation • Goal: • POS-tagged version of GerManC corpus • Problems: • No specialised tagger available for EMG • Limited funds: Manual annotation not feasible for whole corpus

  5. Motivation • Goal: • POS-tagged version of GerManC corpus • Problems: • No specialised tagger available for EMG • Limited funds: Manual annotation not feasible for whole corpus • Question: • How well does an ‘off-the shelf’ tagger for modern German perform on Early Modern German data?

  6. Motivation • Tagger evaluation requires gold standard data

  7. Motivation • Tagger evaluation requires gold standard data • Idea: • Develop gold-standard subcorpus of GerManC • Use subcorpus to test and adapt modern NLP tools • Create historical text processing pipeline

  8. Motivation • Tagger evaluation requires gold standard data • Idea: • Develop gold-standard subcorpus of GerManC • Use subcorpus to test and adapt modern NLP tools • Create historical text processing pipeline • Results useful for other small humanities-based projects wishing to add POS annotations to EMG data

  9. The GerManC corpus

  10. The GerManC corpus • Purpose: Studies of development and standardisation of German language • Texts published between 1650 and 1800 • Sample corpus (2,000 words per text) • Total corpus size: ca. 1 million words • Aims to be “representative”

  11. The GerManC corpus • Eight genres

  12. The GerManC corpus • Three periods 1650-1700 1700-1750 1750-1800

  13. The GerManC corpus • Five regions North German East Central German West Central German East Upper German West Upper German

  14. The GerManC corpus • Three 2,000-word files per genre/period/region • Total size: ca. 1 million words

  15. Gold-standard subcorpus: GerManC-GS • One 2,000-word file per genre and period from North German region  24 files • > 50,000 tokens • Annotated by two historical linguists • Gold standard POS tags, lemmas, and normalised word forms

  16. POS-tagger • TreeTagger (Schmid, 1994) • Statistical, decision tree-based POS tagger • Parameter file for modern German supplied with the tagger • Trained on German newspaper corpus • STTS tagset

  17. STTS-EMG • PIAT(merged with PIDAT): Indefinite determiner, as in ‘viele solche Bemerkungen’ (‘many such remarks’)

  18. STTS-EMG • NA: Adjectives used as nouns, as in ‘der Gesandte’ (‘the ambassador’)

  19. STTS-EMG • PAVREL: Pronominal adverb used as relative, as in ‘die Puppe, damitsiespielt’ (‘the doll with which she plays’) • PTKREL: Indeclinable relative particle, as in ‘die Fälle, so aus Schwachheit entstehen’ (‘the caseswhich arise from weakness’)

  20. STTS-EMG • PWAVREL: Interrogative adverb used as relative, as in ‘der Zaun, worübersiespringt’ (‘the fence over which she jumps’) • PWREL: Interrogative pronoun used as relative, as in ‘etwas, wasersieht’ (‘something which he sees’)

  21. POS-tagging in GerManC-GS • New categories account for 2% of all tokens • IAA on POS-tagging task: 91.6%

  22. Challenges: Tokenisation issues • Clitics: • hastu: hast du (‘have you’) - wirstu: wirst du (‘will you’)

  23. Challenges: Tokenisation issues • Clitics: • has|tu: hast du (‘have you’) - wirs|tu: wirst du (‘will you’)

  24. Challenges: Tokenisation issues • Clitics: • has|tu: hast du (‘have you’) - wirs|tu: wirst du (‘will you’) • Multi-word tokens: • obgleichvs. obgleich (‘even though’)

  25. Challenges: Tokenisation issues • Clitics: • has|tu: hast du (‘have you’) - wirs|tu: wirst du (‘will you’) • Multi-word tokens: • obgleich/KOUS vs. ob/KOUSgleich/ADV (‘even though’)

  26. Challenges: Spelling variation • Spelling not standardised: • Comet  Komet • auff auf • nachdeme nachdem • koͤmpt  kommt • Bothenbrodt Botenbrot • differiret  differiert • beßer  besser • kehme  käme • trucken trockenen • gepressett gepreßt • büxen  Büchsen

  27. Challenges: Spelling variation • All spelling variants in GerManC-GS normalised to a modern standard  Assess what effect spelling variation has on the performance of automatic tools  Help improve automated processing? • Important for: • Automatic tools (POS tagger!) • Accurate corpus search

  28. Challenges: Spelling variation Proportion of normalised word tokens plotted against time

  29. Questions • What is the “off-the-shelf” performance of the TreeTaggeron historical data from the EMG period? • Can the results be improved by running the tool on normalised data?

  30. Results TreeTagger accuracy on original vs. normalised input

  31. Improvement through normalisation over time Tagger performance plotted against publication date

  32. Effects of spelling normalisation on POS tagger performance For normalised tokens: Effect of using original (O)/normalised (N) input on tagger accuracy +: correctly tagged; -: incorrectly tagged

  33. Comparison with “modern” results • Performance of TreeTagger on modern data: ca. 97% (Schmid, 1995) • Current results seem low

  34. Comparison with “modern” results • Performance of TreeTagger on modern data: ca. 97% (Schmid, 1995) • Current results seem low • But: • Modern accuracy figure: evaluation of tagger on the text type it was developed on (newspaper text)

  35. Comparison with “modern” results • Performance of TreeTagger on modern data: ca. 97% (Schmid, 1995) • Current results seem low • But: • Modern accuracy figure: evaluation of tagger on the text type it was developed on (newspaper text) • IAA higher for modern German (98.6%)

  36. Conclusion • Substantial amount of manual post-editing required • Normalisation layer can improve results by 10%, but so far only half of all annotations have positive effect

  37. Future work • Adapt normalisation scheme to account for more cases • Automate normalisation (Jurish, 2010) • Retrain state-of-the-art POS taggers  Evaluation? • Provide detailed information about annotation quality to research community

  38. Thank you! Martin.Durrell@manchester.ac.uk Paul.Bennett@manchester.ac.uk Silke.Scheible@manchester.ac.uk Richard.Whitt@manchester.ac.uk http://tinyurl.com/germanc

More Related