80 likes | 225 Views
Why NLP Needs Theoretical Syntax (It in Fact Already Uses It). Owen Rambow Center for Computational Learning Systems Columbia University, New York City rambow@ccls.columbia.edu. Key Issue: Representation.
E N D
Why NLP Needs Theoretical Syntax(It in Fact Already Uses It) Owen Rambow Center for Computational Learning Systems Columbia University, New York City rambow@ccls.columbia.edu
Key Issue: Representation • Aravind Joshi to statisticians (adapted): “You know how to count, but we tell you what to count” • Linguistic representations are not naturally occurring! • They are devised by linguists • Example: English Penn Treebank • Beatrice Santorini (thesis: historical syntax of Yiddish) • Lots of linguistic theory went into the PTB • PTB annotation manual is a comprehensive descriptive grammar of English
What Sort of Representations for Syntax? • Syntax: links between text and meaning • Text consists of words -> lexical models • Lexicalized formalisms • Note: bi- and monolexical versions of CFG • Need to link to meaning (for example, PropBank) • Extended domain of locality to locate predicate-argument structure • Note: importance of dashtags etc in PTB II • Tree Adjoining Grammar! (but CCG is also cool, and LFG has its own appeal)
Why isn’t everyone using TAG? • The PTB is not annotated with a TAG • Need to do linguistic interpretation on PTB to extract TAG (Chen 2001, Fei 2001) • This is not surprising: all linguistic representations need to be interpreted (Rambow 2010) • Extraction of (P)CFG is simple and requires little interpretation • Extraction of bilexical (P)CFG is not, requires head percolation, which is interpretation
Why isn’t everyone using TAG Parsers? • Unclear how well they are performing • PS evaluation irrelevant • MICA parser (Bangalore et al 2009): • high 80s on a linguistically motivated predicate-argument structure dependency • MALT does slightly better on same representation • But MICA output comes fully interpreted, MALT does not • Once we have a good syntactic pred-arg structure, tasks like semantic role labeling (PropBank) are easier • 95% on args given gold pred-arg structure (Chen and Rambow 2002)
What Have We Learned About TAG Parsing? • Large TAG grammar not easy to manage computationally (MICA: 5000 trees, 1,200 used in parsing) • Small TAG grammars lose too much information • Need to investigate: • Dynamic creation of TAG grammars (trees created in response to need) (note: LTAG-spinal Shen 2006) • “Bushes”: underspecified trees • Metagrammars(Kinyon 2003)
What about All Those Other Languages? • Can’t do treebanks for 3,000 languages • Need to understand cross-linguistic variation and use that understanding in computational models • Cross-linguistic variation: theoretical syntax • Models: NLP • Link: metagrammars for TAG
Summary • Treebanks already encode insights from theoretical syntax • Require interpretation for non-trivial models • Applications other than Parseval require richer representations (and richer evaluations) • But probably English is not the right language to argue for the need for richer syntactic knowledge • Real coming bottleneck: NLP for 3,000 languages