1 / 5

Parsing the NEGRA corpus

Parsing the NEGRA corpus. Greg Donaker June 14, 2006. NEGRA Corpus. German language tagged corpus 20,602 sentences (355,096 tokens) Significantly smaller than Penn Treebank Can be used similarly to Penn Treebank Similar annotations, much flatter trees [Dubey & Keller 2003].

ggooch
Download Presentation

Parsing the NEGRA corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parsing the NEGRA corpus Greg Donaker June 14, 2006

  2. NEGRA Corpus • German language tagged corpus • 20,602 sentences (355,096 tokens) • Significantly smaller than Penn Treebank • Can be used similarly to Penn Treebank • Similar annotations, much flatter trees [Dubey & Keller 2003]

  3. Baseline error analysis • Ran through Stanford Parser using NEGRA specific parameters • 91.75% tagging accuracy • PCFG f-score: 66.42 • Most frequently underproposed rule: • NP -> ART NN (98 times) • Most frequently underproposed category: • NN (498 times – three times the next category) • These errors seem abnormally high based on the structure of German language.

  4. Approach • Bug modeled tag distribution of unknown words as baseline distribution • Reworked unknown word model to specifics of German language • Model based on first letter, capitalization of first letter, ending substring of words

  5. Results • Best performing (on both test and validation sets) model matched intuition • Capitalization of first letter, last two characters of word • Improves Tagging accuracy from 91.75% to 94.49% • Improves PCFG F-score from 66.42 to 69.87 • Reduces underproposed NP->ART NN from 98 to 48 • Reduces underproposed NN from 498 to 73

More Related