60 likes | 80 Views
Analyzing and improving tag distribution for German text parsing based on NEGRA Corpus data, achieving higher accuracy and reducing errors.
E N D
Parsing the NEGRA corpus Greg Donaker June 14, 2006
NEGRA Corpus • German language tagged corpus • 20,602 sentences (355,096 tokens) • Significantly smaller than Penn Treebank • Can be used similarly to Penn Treebank • Similar annotations, much flatter trees [Dubey & Keller 2003]
Baseline error analysis • Ran through Stanford Parser using NEGRA specific parameters • 91.75% tagging accuracy • PCFG f-score: 66.42 • Most frequently underproposed rule: • NP -> ART NN (98 times) • Most frequently underproposed category: • NN (498 times – three times the next category) • These errors seem abnormally high based on the structure of German language.
Approach • Bug modeled tag distribution of unknown words as baseline distribution • Reworked unknown word model to specifics of German language • Model based on first letter, capitalization of first letter, ending substring of words
Results • Best performing (on both test and validation sets) model matched intuition • Capitalization of first letter, last two characters of word • Improves Tagging accuracy from 91.75% to 94.49% • Improves PCFG F-score from 66.42 to 69.87 • Reduces underproposed NP->ART NN from 98 to 48 • Reduces underproposed NN from 498 to 73