Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text

Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text SilkeScheible, Richard Jason Whitt, Martin Durrell, and Paul Bennett The GerManC project School of Languages, Linguistics, and Cultures University of Manchester (UK)

Overview • Motivation • The GerManC corpus • POS-tagger and tagset • Challenges • Results

Motivation • Goal: • POS-tagged version of GerManC corpus

Motivation • Goal: • POS-tagged version of GerManC corpus • Problems: • No specialised tagger available for EMG • Limited funds: Manual annotation not feasible for whole corpus

Motivation • Goal: • POS-tagged version of GerManC corpus • Problems: • No specialised tagger available for EMG • Limited funds: Manual annotation not feasible for whole corpus • Question: • How well does an ‘off-the shelf’ tagger for modern German perform on Early Modern German data?

Motivation • Tagger evaluation requires gold standard data

Motivation • Tagger evaluation requires gold standard data • Idea: • Develop gold-standard subcorpus of GerManC • Use subcorpus to test and adapt modern NLP tools • Create historical text processing pipeline

Motivation • Tagger evaluation requires gold standard data • Idea: • Develop gold-standard subcorpus of GerManC • Use subcorpus to test and adapt modern NLP tools • Create historical text processing pipeline • Results useful for other small humanities-based projects wishing to add POS annotations to EMG data

The GerManC corpus

The GerManC corpus • Purpose: Studies of development and standardisation of German language • Texts published between 1650 and 1800 • Sample corpus (2,000 words per text) • Total corpus size: ca. 1 million words • Aims to be “representative”

The GerManC corpus • Eight genres

The GerManC corpus • Three periods 1650-1700 1700-1750 1750-1800

The GerManC corpus • Five regions North German East Central German West Central German East Upper German West Upper German

The GerManC corpus • Three 2,000-word files per genre/period/region • Total size: ca. 1 million words

Gold-standard subcorpus: GerManC-GS • One 2,000-word file per genre and period from North German region  24 files • > 50,000 tokens • Annotated by two historical linguists • Gold standard POS tags, lemmas, and normalised word forms

POS-tagger • TreeTagger (Schmid, 1994) • Statistical, decision tree-based POS tagger • Parameter file for modern German supplied with the tagger • Trained on German newspaper corpus • STTS tagset

STTS-EMG • PIAT(merged with PIDAT): Indefinite determiner, as in ‘viele solche Bemerkungen’ (‘many such remarks’)

STTS-EMG • NA: Adjectives used as nouns, as in ‘der Gesandte’ (‘the ambassador’)

STTS-EMG • PAVREL: Pronominal adverb used as relative, as in ‘die Puppe, damitsiespielt’ (‘the doll with which she plays’) • PTKREL: Indeclinable relative particle, as in ‘die Fälle, so aus Schwachheit entstehen’ (‘the caseswhich arise from weakness’)

STTS-EMG • PWAVREL: Interrogative adverb used as relative, as in ‘der Zaun, worübersiespringt’ (‘the fence over which she jumps’) • PWREL: Interrogative pronoun used as relative, as in ‘etwas, wasersieht’ (‘something which he sees’)

POS-tagging in GerManC-GS • New categories account for 2% of all tokens • IAA on POS-tagging task: 91.6%

Challenges: Tokenisation issues • Clitics: • hastu: hast du (‘have you’) - wirstu: wirst du (‘will you’)

Challenges: Tokenisation issues • Clitics: • has|tu: hast du (‘have you’) - wirs|tu: wirst du (‘will you’)

Challenges: Tokenisation issues • Clitics: • has|tu: hast du (‘have you’) - wirs|tu: wirst du (‘will you’) • Multi-word tokens: • obgleichvs. obgleich (‘even though’)

Challenges: Tokenisation issues • Clitics: • has|tu: hast du (‘have you’) - wirs|tu: wirst du (‘will you’) • Multi-word tokens: • obgleich/KOUS vs. ob/KOUSgleich/ADV (‘even though’)

Challenges: Spelling variation • Spelling not standardised: • Comet  Komet • auff auf • nachdeme nachdem • koͤmpt  kommt • Bothenbrodt Botenbrot • differiret  differiert • beßer  besser • kehme  käme • trucken trockenen • gepressett gepreßt • büxen  Büchsen

Challenges: Spelling variation • All spelling variants in GerManC-GS normalised to a modern standard  Assess what effect spelling variation has on the performance of automatic tools  Help improve automated processing? • Important for: • Automatic tools (POS tagger!) • Accurate corpus search

Challenges: Spelling variation Proportion of normalised word tokens plotted against time

Questions • What is the “off-the-shelf” performance of the TreeTaggeron historical data from the EMG period? • Can the results be improved by running the tool on normalised data?

Results TreeTagger accuracy on original vs. normalised input

Improvement through normalisation over time Tagger performance plotted against publication date

Effects of spelling normalisation on POS tagger performance For normalised tokens: Effect of using original (O)/normalised (N) input on tagger accuracy +: correctly tagged; -: incorrectly tagged

Comparison with “modern” results • Performance of TreeTagger on modern data: ca. 97% (Schmid, 1995) • Current results seem low

Comparison with “modern” results • Performance of TreeTagger on modern data: ca. 97% (Schmid, 1995) • Current results seem low • But: • Modern accuracy figure: evaluation of tagger on the text type it was developed on (newspaper text)

Comparison with “modern” results • Performance of TreeTagger on modern data: ca. 97% (Schmid, 1995) • Current results seem low • But: • Modern accuracy figure: evaluation of tagger on the text type it was developed on (newspaper text) • IAA higher for modern German (98.6%)

Conclusion • Substantial amount of manual post-editing required • Normalisation layer can improve results by 10%, but so far only half of all annotations have positive effect

Future work • Adapt normalisation scheme to account for more cases • Automate normalisation (Jurish, 2010) • Retrain state-of-the-art POS taggers  Evaluation? • Provide detailed information about annotation quality to research community

Thank you! Martin.Durrell@manchester.ac.uk Paul.Bennett@manchester.ac.uk Silke.Scheible@manchester.ac.uk Richard.Whitt@manchester.ac.uk http://tinyurl.com/germanc

Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text

Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text

Presentation Transcript

COMMERCIAL OFF-THE-SHELF (COTS)

Early German Psychologists

Modern Foreign Languages- German

The Early Modern Period

The Early Modern Body

Marathi Verb Morphology and POS Tagger

POS Tagger and Chunker for Tamil

Life on the Continental Shelf

GeoCam - An off-the-shelf Imager for Rapid Response Remote Sensing Monitoring

Early Modern Sussex: An Exhibition

Where (Early) Modern Frisian shades off into Dutch

Commentary on “ The Innovation Continuum: Moving Technologies Off the Shelf”

The Early Modern Period

JPL’s Commercial Off-The-Shelf (COTS) Program

German Early Level

Getting Data Off the Shelf

Early Modern Sussex: An Exhibition

German Early Level

Tagger

The Early Modern English

Informational Text Evaluating an Argument

THE EARLY MODERN PERIOD