Maximizing Parser Performance Through Treebank Training Data

Treebanks as Training Data for Parsers Joakim Nivre Växjö University and Uppsala University E-mail: nivre@msi.vxu.se

Q1:What do you really care about when you’re building a parser? • For parsing unrestricted text, I care about the joint optimization of: • Robustness • Disambiguation • Accuracy • Efficiency • Requirement on syntactic annotation: • Balance between expressivity and complexity

Example:Mildly Non-Projective Dependency Structures • Dependency structure in two treebanks: • Strictly projective (efficiently parsable): • PDT: 75% • DDT: 85% • Unrestricted non-projective (often intractable): • PDT: 100% • DDT: 100% • Well-nested, gap degree ≤ 1: • PDT: 99.5% • DDT: 99.7% • Design choice in treebank annotation?

Q2: What works, what doesn’t? • Anything works? • Top systems in CoNLL 2006 shared task: • MSTParser: Global, exhaustive, graph-based • MaltParser: Local, greedy, stack-based • Features more important than parsers? • But not for all languages? • Results from CoNLL 2007 shared task: • Configurational languages ≈ 85% LAS(Catalan, Chinese, English, Italian) • Richly inflected languages ≈ 75% LAS(Arabic, Basque, Czech, Greek, Hungarian, Turkish) • Treebank problem or parser problem?

Q3: What information is useful, what is not? • Word level: • Morphological analysis (lemma, derivation, inflection) • Hierarchical parts-of-speech (incl. features) • Sentence level: • Complete structural annotation (phrases, heads) • Complete functional annotation (syntactic relations) • Deep/non-local dependencies • Integrated morpho-syntactic annotation: • The key to parsing richly inflected languages?

Skipping a few questions … • Q4: How does grammar writing interact with treebanking? • No idea. Not my cup of tea. • Q5: What methodological lessons can be drawn for treebanking? • Q6: What are advantages and disadvantages of preprocessing the data to be treebanked with an automatic parser? • Don’t know. Never got funding to build a real treebank.

Q7: Advantages of a phrase structure and/or a dependency treebank? • Obvious answer: • Phrase structure is good for phrase structure parsing. • Dependency is good for dependency parsing. • Methodological point: • Parsing lossy conversions can be questionable. • Remedy: • Make annotations (just) rich enough to support both. • Annotation scheme: • Minimal source annotation • Well-defined conversions to target annotations

Maximizing Parser Performance Through Treebank Training Data

Maximizing Parser Performance Through Treebank Training Data

Presentation Transcript

Parsers and Grammars

Comparing Java XML parsers

Thoughts on Treebanks

XML Parsers

LR PARSERS

Parsers

Training dependency parsers by jointly optimizing multiple objectives

Training Data for RFI

Parsers

Parsers and Grammar

XML Parsers

Semi-supervised Training of Statistical Parsers

Extracting LTAGs from Treebanks

Recursive Descent Parsers

Recursive Descent Parsers

MC365 XML Parsers

XML Parsers

Parsers and Grammars

Parsers

Training Data for AI