380 likes | 515 Views
Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text. Silke Scheible , Richard Jason Whitt, Martin Durrell , and Paul Bennett. The GerManC project School of Languages, Linguistics, and Cultures University of Manchester (UK). Overview. Motivation The GerManC corpus
E N D
Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text SilkeScheible, Richard Jason Whitt, Martin Durrell, and Paul Bennett The GerManC project School of Languages, Linguistics, and Cultures University of Manchester (UK)
Overview • Motivation • The GerManC corpus • POS-tagger and tagset • Challenges • Results
Motivation • Goal: • POS-tagged version of GerManC corpus
Motivation • Goal: • POS-tagged version of GerManC corpus • Problems: • No specialised tagger available for EMG • Limited funds: Manual annotation not feasible for whole corpus
Motivation • Goal: • POS-tagged version of GerManC corpus • Problems: • No specialised tagger available for EMG • Limited funds: Manual annotation not feasible for whole corpus • Question: • How well does an ‘off-the shelf’ tagger for modern German perform on Early Modern German data?
Motivation • Tagger evaluation requires gold standard data
Motivation • Tagger evaluation requires gold standard data • Idea: • Develop gold-standard subcorpus of GerManC • Use subcorpus to test and adapt modern NLP tools • Create historical text processing pipeline
Motivation • Tagger evaluation requires gold standard data • Idea: • Develop gold-standard subcorpus of GerManC • Use subcorpus to test and adapt modern NLP tools • Create historical text processing pipeline • Results useful for other small humanities-based projects wishing to add POS annotations to EMG data
The GerManC corpus • Purpose: Studies of development and standardisation of German language • Texts published between 1650 and 1800 • Sample corpus (2,000 words per text) • Total corpus size: ca. 1 million words • Aims to be “representative”
The GerManC corpus • Eight genres
The GerManC corpus • Three periods 1650-1700 1700-1750 1750-1800
The GerManC corpus • Five regions North German East Central German West Central German East Upper German West Upper German
The GerManC corpus • Three 2,000-word files per genre/period/region • Total size: ca. 1 million words
Gold-standard subcorpus: GerManC-GS • One 2,000-word file per genre and period from North German region 24 files • > 50,000 tokens • Annotated by two historical linguists • Gold standard POS tags, lemmas, and normalised word forms
POS-tagger • TreeTagger (Schmid, 1994) • Statistical, decision tree-based POS tagger • Parameter file for modern German supplied with the tagger • Trained on German newspaper corpus • STTS tagset
STTS-EMG • PIAT(merged with PIDAT): Indefinite determiner, as in ‘viele solche Bemerkungen’ (‘many such remarks’)
STTS-EMG • NA: Adjectives used as nouns, as in ‘der Gesandte’ (‘the ambassador’)
STTS-EMG • PAVREL: Pronominal adverb used as relative, as in ‘die Puppe, damitsiespielt’ (‘the doll with which she plays’) • PTKREL: Indeclinable relative particle, as in ‘die Fälle, so aus Schwachheit entstehen’ (‘the caseswhich arise from weakness’)
STTS-EMG • PWAVREL: Interrogative adverb used as relative, as in ‘der Zaun, worübersiespringt’ (‘the fence over which she jumps’) • PWREL: Interrogative pronoun used as relative, as in ‘etwas, wasersieht’ (‘something which he sees’)
POS-tagging in GerManC-GS • New categories account for 2% of all tokens • IAA on POS-tagging task: 91.6%
Challenges: Tokenisation issues • Clitics: • hastu: hast du (‘have you’) - wirstu: wirst du (‘will you’)
Challenges: Tokenisation issues • Clitics: • has|tu: hast du (‘have you’) - wirs|tu: wirst du (‘will you’)
Challenges: Tokenisation issues • Clitics: • has|tu: hast du (‘have you’) - wirs|tu: wirst du (‘will you’) • Multi-word tokens: • obgleichvs. obgleich (‘even though’)
Challenges: Tokenisation issues • Clitics: • has|tu: hast du (‘have you’) - wirs|tu: wirst du (‘will you’) • Multi-word tokens: • obgleich/KOUS vs. ob/KOUSgleich/ADV (‘even though’)
Challenges: Spelling variation • Spelling not standardised: • Comet Komet • auff auf • nachdeme nachdem • koͤmpt kommt • Bothenbrodt Botenbrot • differiret differiert • beßer besser • kehme käme • trucken trockenen • gepressett gepreßt • büxen Büchsen
Challenges: Spelling variation • All spelling variants in GerManC-GS normalised to a modern standard Assess what effect spelling variation has on the performance of automatic tools Help improve automated processing? • Important for: • Automatic tools (POS tagger!) • Accurate corpus search
Challenges: Spelling variation Proportion of normalised word tokens plotted against time
Questions • What is the “off-the-shelf” performance of the TreeTaggeron historical data from the EMG period? • Can the results be improved by running the tool on normalised data?
Results TreeTagger accuracy on original vs. normalised input
Improvement through normalisation over time Tagger performance plotted against publication date
Effects of spelling normalisation on POS tagger performance For normalised tokens: Effect of using original (O)/normalised (N) input on tagger accuracy +: correctly tagged; -: incorrectly tagged
Comparison with “modern” results • Performance of TreeTagger on modern data: ca. 97% (Schmid, 1995) • Current results seem low
Comparison with “modern” results • Performance of TreeTagger on modern data: ca. 97% (Schmid, 1995) • Current results seem low • But: • Modern accuracy figure: evaluation of tagger on the text type it was developed on (newspaper text)
Comparison with “modern” results • Performance of TreeTagger on modern data: ca. 97% (Schmid, 1995) • Current results seem low • But: • Modern accuracy figure: evaluation of tagger on the text type it was developed on (newspaper text) • IAA higher for modern German (98.6%)
Conclusion • Substantial amount of manual post-editing required • Normalisation layer can improve results by 10%, but so far only half of all annotations have positive effect
Future work • Adapt normalisation scheme to account for more cases • Automate normalisation (Jurish, 2010) • Retrain state-of-the-art POS taggers Evaluation? • Provide detailed information about annotation quality to research community
Thank you! Martin.Durrell@manchester.ac.uk Paul.Bennett@manchester.ac.uk Silke.Scheible@manchester.ac.uk Richard.Whitt@manchester.ac.uk http://tinyurl.com/germanc