1 / 7

Maximizing Parser Performance Through Treebank Training Data

Explore key factors in parser development: robustness, disambiguation, accuracy, and efficiency. Learn about syntactic annotation balance for effective parsing, illustrated with dependency structures from treebanks. Discover the impact of grammar writing, data preprocessing, and treebank advantages.

Download Presentation

Maximizing Parser Performance Through Treebank Training Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Treebanks as Training Data for Parsers Joakim Nivre Växjö University and Uppsala University E-mail: nivre@msi.vxu.se

  2. Q1:What do you really care about when you’re building a parser? • For parsing unrestricted text, I care about the joint optimization of: • Robustness • Disambiguation • Accuracy • Efficiency • Requirement on syntactic annotation: • Balance between expressivity and complexity

  3. Example:Mildly Non-Projective Dependency Structures • Dependency structure in two treebanks: • Strictly projective (efficiently parsable): • PDT: 75% • DDT: 85% • Unrestricted non-projective (often intractable): • PDT: 100% • DDT: 100% • Well-nested, gap degree ≤ 1: • PDT: 99.5% • DDT: 99.7% • Design choice in treebank annotation?

  4. Q2: What works, what doesn’t? • Anything works? • Top systems in CoNLL 2006 shared task: • MSTParser: Global, exhaustive, graph-based • MaltParser: Local, greedy, stack-based • Features more important than parsers? • But not for all languages? • Results from CoNLL 2007 shared task: • Configurational languages ≈ 85% LAS(Catalan, Chinese, English, Italian) • Richly inflected languages ≈ 75% LAS(Arabic, Basque, Czech, Greek, Hungarian, Turkish) • Treebank problem or parser problem?

  5. Q3: What information is useful, what is not? • Word level: • Morphological analysis (lemma, derivation, inflection) • Hierarchical parts-of-speech (incl. features) • Sentence level: • Complete structural annotation (phrases, heads) • Complete functional annotation (syntactic relations) • Deep/non-local dependencies • Integrated morpho-syntactic annotation: • The key to parsing richly inflected languages?

  6. Skipping a few questions … • Q4: How does grammar writing interact with treebanking? • No idea. Not my cup of tea. • Q5: What methodological lessons can be drawn for treebanking? • Q6: What are advantages and disadvantages of preprocessing the data to be treebanked with an automatic parser? • Don’t know. Never got funding to build a real treebank.

  7. Q7: Advantages of a phrase structure and/or a dependency treebank? • Obvious answer: • Phrase structure is good for phrase structure parsing. • Dependency is good for dependency parsing. • Methodological point: • Parsing lossy conversions can be questionable. • Remedy: • Make annotations (just) rich enough to support both. • Annotation scheme: • Minimal source annotation • Well-defined conversions to target annotations

More Related