160 likes | 312 Views
Parsing Unrestricted Text. Joakim Nivre. Two Notions of Parsing. Grammar parsing: Given a grammar G and an input string x *, derive some or all of the analyses y assigned to x by G . Text parsing:
E N D
Parsing Unrestricted Text Joakim Nivre
Two Notions of Parsing • Grammar parsing: • Given a grammar G and an input string x *, derive some or all of the analyses y assigned to x by G. • Text parsing: • Given a text T = (x1, …, xn), derive the correct analysis yi for every sentence xi T.
Grammar Parsing • Properties of grammar parsing: • Abstract problem: Mapping from (G, x) to y. • Parsing implies recognition; analyses defined only if xL(G). • Correctness (consistency and completeness) can be proven without considering any input string x.
Text Parsing • Properties of text parsing: • Not a well-defined abstract problem (the text language is not a formal language). • Parsing does not imply recognition (recognition presupposes a formal language). • Empirical approximation problem. • Correctness can only be established with reference to empirical samples of the text language (statistical inference).
Two Methods for Text Parsing • Grammar-driven text parsing: • Text parsing approximated by grammar parsing. • Data-driven text parsing: • Text parsing approximated by statistical inference. • Not mutually exclusive methods: • Grammars can be combined with statistical inference (e.g. PCFG).
Grammar-Driven Text Parsing • Basic assumption: • The text language L can be approximated by L(G). • Potential problems (evaluation criteria): • Robustness • Disambiguation • Accuracy • Efficiency
Robustness • Basic issue: • What happens if xL(G)? • Two cases: • xL(G), xL (coverage) • xL(G), x L (robustness) • Techniques: • Constraint relaxation • Partial parsing
Disambiguation • Basic issue: • What happens when G assigns more than one analysis y to a sentence x? • Two cases: • String ambiguity (real) (disambiguation) • Grammar ambiguity (spurious) (leakage) • Techniques: • Grammar specialization • Deterministic parsing • Eliminative parsing • Data-driven parsing (e.g. PCFG)
Accuracy • Basic issue: • How often can the parser deliver a single correct analysis? • Grammar-driven techniques: • Linguistically adequate analyses? • Adequacy undermined by techniques to handle robustness and disambiguation.
Efficiency • Theoretical complexity: • Many linguistically motivated formalisms have intractable parsing problems. • Even polynomially parsable formalims often have high complexity. • Practical efficiency is also affected by: • Grammar constants • Techniques for handling robustness and disambiguation
Data-Driven Text Parsing • Basic assumption: • The text language L can be approximated by statistical inference from text samples. • Components: • A formal model M defining permissible representations for sentences in L • A sample of text Tt = (x1, …, xn) from L, with or without the correct analyses At = (y1, …, yn) • An inductive inference scheme I defining actual analyses for the sentences of any text T = (x1,…,xn) in L, relative to M and Tt (and possibly At)
Robustness • Basic issue: • Is M a grammar or not (cf. PCFG)? • Radical constraint relaxation: • Ensure that every string has at least one analysis. • Example (DOP3): • M permits any parse tree composed from subtrees in Tt, with free insertion of (even unseen) words from x. • Tt is annotated with context-free parse trees. • I defines the probability P(x, y) to be the sum of the probabilities of each derivation of y for x (for any x, y).
Disambiguation • Basic issue: • How rank different analyses yi of x? • Structure of I: • A parameterized stochastic model M, assigning a score S(x, yi) to each permissible analysis yi of x, relative to a set of parameters . • A parsing method, i.e. a method for computing the best yi according to S(x, yi) (given ). • A learning method, i.e. a method for instantiating based on inductive inference from Tt. • Example: PCFG
Accuracy • Basic issue: • How often can the parser deliver a single correct analysis? • Data-driven techniques: • Empirically adequate ranking of alternatives? • Accuracy undermined by combinatorial explosion due to radical constraint relaxation.
Efficiency • Theoretical complexity: • Many data-driven models have intractable inference problems. • Even polynomially parsable models often have high complexity. • Practical efficiency is also affected by: • Model constants • Techniques for handling robustness and disambiguation
Converging Approaches? • Text parsing: • Complex optimization problem • Two optimization strategies: • Start with good accuracy, improve robustness and disambiguation (while controlling efficiency). • Start with good disambiguation (and robustness), improve accuracy (while controlling efficiency). • Strategies converging on the same solution? • Constraint relaxation for robustness • Data-driven models for disambiguation • Heuristic search techniques for efficiency