70 likes | 133 Views
Explore key factors in parser development: robustness, disambiguation, accuracy, and efficiency. Learn about syntactic annotation balance for effective parsing, illustrated with dependency structures from treebanks. Discover the impact of grammar writing, data preprocessing, and treebank advantages.
E N D
Treebanks as Training Data for Parsers Joakim Nivre Växjö University and Uppsala University E-mail: nivre@msi.vxu.se
Q1:What do you really care about when you’re building a parser? • For parsing unrestricted text, I care about the joint optimization of: • Robustness • Disambiguation • Accuracy • Efficiency • Requirement on syntactic annotation: • Balance between expressivity and complexity
Example:Mildly Non-Projective Dependency Structures • Dependency structure in two treebanks: • Strictly projective (efficiently parsable): • PDT: 75% • DDT: 85% • Unrestricted non-projective (often intractable): • PDT: 100% • DDT: 100% • Well-nested, gap degree ≤ 1: • PDT: 99.5% • DDT: 99.7% • Design choice in treebank annotation?
Q2: What works, what doesn’t? • Anything works? • Top systems in CoNLL 2006 shared task: • MSTParser: Global, exhaustive, graph-based • MaltParser: Local, greedy, stack-based • Features more important than parsers? • But not for all languages? • Results from CoNLL 2007 shared task: • Configurational languages ≈ 85% LAS(Catalan, Chinese, English, Italian) • Richly inflected languages ≈ 75% LAS(Arabic, Basque, Czech, Greek, Hungarian, Turkish) • Treebank problem or parser problem?
Q3: What information is useful, what is not? • Word level: • Morphological analysis (lemma, derivation, inflection) • Hierarchical parts-of-speech (incl. features) • Sentence level: • Complete structural annotation (phrases, heads) • Complete functional annotation (syntactic relations) • Deep/non-local dependencies • Integrated morpho-syntactic annotation: • The key to parsing richly inflected languages?
Skipping a few questions … • Q4: How does grammar writing interact with treebanking? • No idea. Not my cup of tea. • Q5: What methodological lessons can be drawn for treebanking? • Q6: What are advantages and disadvantages of preprocessing the data to be treebanked with an automatic parser? • Don’t know. Never got funding to build a real treebank.
Q7: Advantages of a phrase structure and/or a dependency treebank? • Obvious answer: • Phrase structure is good for phrase structure parsing. • Dependency is good for dependency parsing. • Methodological point: • Parsing lossy conversions can be questionable. • Remedy: • Make annotations (just) rich enough to support both. • Annotation scheme: • Minimal source annotation • Well-defined conversions to target annotations