920 likes | 1.02k Views
Treebank-Based Wide Coverage Probabilistic LFG Resources. Josef van Genabith, Aoife Cahill, Grzegorz Chrupala, Jennifer Foster, Deirdre Hogan, Conor Cafferkey, Mick Burke, Ruth O’Donovan, Yvette Graham, Karolina Owczarzak, Yuqing Guo, Ines Rehbein, Natalie Schluter and Djame Sedah
E N D
Treebank-Based Wide Coverage Probabilistic LFG Resources Josef van Genabith, Aoife Cahill, Grzegorz Chrupala, Jennifer Foster, Deirdre Hogan, Conor Cafferkey, Mick Burke, Ruth O’Donovan, Yvette Graham, Karolina Owczarzak, Yuqing Guo, Ines Rehbein, Natalie Schluter and Djame Sedah National Centre for Language Technology NCLT School of Computing, Dublin City University Treebank-Based LFG Resources
Overview • Context/Motivation • Treebank-Based Acquisition of Wide-Coverage LFG Resources (Penn-II) • LFG • Automatic F-Structure Annotation Algorithm • Acquisition of Lexical Resources • Parsing • Parsing Architectures • LDD-Resolution • Comparison with Hand-Crafted (XLE, RASP) and Treebank-Based (CCG, HPSG) Resources • Generation • Basic Generator • Generation Grammar Transforms • “History-Based” Generation • MT Evaluation Treebank-Based LFG Resources
Motivation • What do grammars do? • Grammars define languages as sets of strings • Grammars define what strings are grammatical and what strings are not • Grammars tell us about the syntactic structure of (associated with) strings • “Shallow” vs. “Deep” grammars • Shallow grammars do all of the above • Deep grammars (in addition) relate text to information/meaning representation • Information: predicate-argument-adjunct structure, deep dependency relations, logical forms, … • In natural languages, linguistic material is not always interpreted locally where you encounter it: long-distance dependencies (LDDs) • Resolution of LDDs crucial to construct accurate and complete information/meaning representations. • Deep grammars := (text <-> meaning) + (LDD resolution) Treebank-Based LFG Resources
Motivation • Constraint-Based Grammar Formalisms (FU, GPSG, PATR-II, …) • Lexical-Functional Grammar (LFG) • Head-Driven Phrase Structure Grammar (HPSG) • Combinatory Categorial Grammar (CCG) • Tree-Adjoining Grammar (TAG) • Traditionally, deep constraint-based grammars are hand-crafted • LFG ParGram, HPSG LingoErg, Core Language Engine CLE, Alvey Tools, RASP, ALPINO, … • Wide-coverage, deep constraint-based grammar development is very time consuming, knowledge extensive and expensive! • Very hard to scale hand-crafted grammars to unrestricted text! • English XLE (Riezler et al. 2002); German XLE (Forst and Rohrer 2006); Japanese XLE (Masuichi and Okuma 2003); RASP (Carroll and Briscoe 2002); ALPINO (Bouma, van Noord and Malouf, 2000) Treebank-Based LFG Resources
Motivation • Instance of “knowledge acquisition bottleneck” familiar from classical “rationalist rule/knowledge-based” AI/NLP • Alternative to classical “rationalist” rule/knowledge-based AI/NLP • “Empiricistdata-driven ” research paradigm (AI/NLP): • Corpora, …, machine-learning-based and statistical approaches, … • Treebank-based grammar acquisition, probabilistic parsing • Advantage: grammars can be induced (learned) automatically • Very low development cost, wide-coverage, robust, but … • Most treebank-based grammar induction/parsing technology produces “shallow” grammars • Shallow grammars don’t resolve LDDs (but see (Johnson 2002); …), do not map strings to information/meaning representations … Treebank-Based LFG Resources
Motivation • Poses a number of research questions: • Can we address the knowledge acquisition bottleneck for deep grammar development by combining insights from rationalist and empiricist research paradigms? • Specifically: • Can we automatically acquire wide-coverage “deep”, probabilistic, constraint-based grammars from treebanks? • How do we use them in parsing? • Can we use them for generation? • Can we acquire resources for different languages and treebank encodings? • How do these resources compare with hand-crafted resources? • How do they fare in applications … ? Treebank-Based LFG Resources
Context • TAG (Xia, 2001) • LFG (Cahill, McCarthy, van Genabith and Way, 2002) • CCG (Hockenmaier & Steedman, 2002) • HPSG (Miyao and Tsujii, 2003) • LFG • (van Genabith, Sadler and Way, 1999) • (Frank, 2000) • (Sadler, van Genabith and Way, 2000) • (Frank, Sadler, van Genabith and Way, 2003) Treebank-Based LFG Resources
Lexical-Functional Grammar (LFG) Parsing Treebank-Based LFG Resources
LFG Acquisition for English - Overview • Treebank-Based Acquisition of LFG Resources (Penn-II) • Lexical Functional Grammar LFG • Penn-II Treebank & Preprocessing/Clean-Up • F-Str Annotation Algorithm • Grammar and Lexicon Extraction • Parsing Architectures (LDD Resolution) • Comparison with best hand-crafted resources: XLE and RASP • Comparison with treebank-based CCG and HPSG resources Treebank-Based LFG Resources
Lexical-Functional Grammar (LFG) Lexical-Functional Grammar (LFG) (Bresnan & Kaplan 1981, Bresnan 2001, Dalrymple 2001) is a constraint-based theory of grammar. Two (basic) levels of representation: • C-structure: represents surface grammatical configurations such as word order, annotated CFG rules/trees • F-structure: represents abstract syntactic functions such as SUBJ(ject), OBJ(ect), OBL(ique), PRED(icate), COMP(lement), ADJ(unct) …, AVM attribute-value matrices/feature structures F-structure approximates to basic predicate-argument structure, dependency representation, logical form (van Genabith and Crouch, 1996; 1997) Treebank-Based LFG Resources
Lexical-Functional Grammar (LFG) Treebank-Based LFG Resources
Lexical-Functional Grammar (LFG) • Subcategorisation: • Semantic forms (subcat frames): see<SUBJ,OBJ> • Completeness: all GFs in semantic form present at local f-structure • Coherence: only the GFs in semantic form present at local f-structure • Long Distance Dependencies (LDDs): resolved at f-structure with • Functional Uncertainty Equations (regular expressions specifying paths in f-structure): e.g. TOPICREL = COMP* OBJ • subcat frames • Completeness/Coherence. Treebank-Based LFG Resources
Lexical-Functional Grammar (LFG) Treebank-Based LFG Resources
Introduction: Penn-II & LFG • If we had f-structure annotated version of Penn-II, we could use (standard) machine learning methods to extract probabilistic, wide-coverage LFG resources • How do we get f-structure annotated Penn-II? • Manually? No: ~50,000 trees …! • Automatically! Yes: F-Structure annotation algorithm… ! • Penn-II is a 2nd generation treebank– contains lots of annotations to support derivation of deep meaning representations: • trees, Penn-II “functional” tags (-SBJ, -TMP, -LOC), traces & coindexation • f-structure annotation algorithm exploits those. Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG Treebank-Based LFG Resources
Treebank Preprocessing/Clean-Up: Penn-II & LFG • Penn-II treebank: often flat analyses (coordination, NPs …), a certain amount of noise: inconsistent annotations, errors … • No treebank preprocessing or clean-up in the LFG approach (unlike CCG- and HPSG-based approaches) • Take Penn-II treebank as is, but • Remove all trees with FRAG or X labelled constituents • Frag = fragments, X = not known how to annotate • Total of 48,424 trees as they are. Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG • Annotation-based (rather than conversion-based) • Automatic annotation of nodes in Penn-II treebank trees with f-structure equations • Annotation Algorithm exploits: • Head information • Categorial information • Configurational information • Penn-II functional tags • Trace information Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG Architecture of a modular algorithm to assign LFG f-structure equations to trees in the Penn-II treebank: Head-Lexicalisation [Magerman,1994] Left-Right Context Annotation Principles Proto F-Structures Coordination Annotation Principles Proper F-Structures Catch-All and Clean-Up Traces Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG • Head Lexicalisation: modified rules based on (Magerman, 1994) Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG Left Context Head Right Context Left-Right Context Annotation Principles: • Head of NP likely to be rightmost noun … • Mother →Left Context Head Right Context Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG Left-Right Annotation Matrix NP: NP NP DT ADJP NN ↑=↓ NN ↑spec:det=↓ DT ↓↑adjunct ADJP → a RB JJ deal a RB JJ deal very politicized very politicized Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG • Do annotation matrix for each of the monadic categories (without –Fun tags) in Penn-II • Based on analysing the most frequent rule types for each categorysuch that • sum total of token frequencies of these rule types is greater than 85% of total number of rule tokens for that category 100% 85% 100% 85% • NP 6595 102 VP 10239 307 • S 2602 20 ADVP 234 6 • Apply annotation matrix to all (i.e. also unseen) rules/sub-trees, i.e. also those NP-LOC, NP-TMP etc. Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG • Traces Module: • Long Distance Dependencies (LDDs) • Topicalisation • Questions • Wh- and wh-less relative clauses • Passivisation • Control constructions • ICH (interpret constituent here) • RNR (right node raising) • … • Translate Penn-II traces and coindexation into corresponding reentrancy in f-structure Treebank-Based LFG Resources
Treebank Annotation: Control & Wh-Rel. LDD Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG Head-Lexicalisation [Magerman,1995] Left-Right Context Annotation Principles Proto F-Structures Coordination Annotation Principles Proper F-Structures Catch-All and Clean-Up Traces Constraint Solver Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG • Collect f-structure equations • Send to constraint solver • Generates f-structures • F-structure annotation algorithm in Java, constraint solver in Prolog • ~3 min annotating ~50,000 Penn-II trees • ~5 min producing ~50,000 f-structures Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG Evaluation (Quantitative): • Coverage: Over 99.8% of Penn-II sentences (without X and FRAG constituents) receive a single covering and connected f-structure: Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG • F-structure quality evaluation against DCU 105Dependency Bank, a manually annotated dependency gold standard of 105 sentences randomly extracted from WSJ section 23. • Triples are extracted from the gold standard • Evaluation software from (Crouch et al. 2002) and (Riezler et al. 2002) relation(predicate~0, argument~1) Treebank-Based LFG Resources
Treebank Annotation: Penn-II & LFG • Following (Kaplan et al. 2004) evaluation against PARC 700 Dependency Bank calculated for: all annotations PARC features preds-only • Mapping required (Burke 2004, 2006) Treebank-Based LFG Resources
Grammar and Lexicon Extraction : Penn-II & LFG Lexical Resources: • Lexical information extremely important in modern lexicalised grammar formalisms • LFG, HPSG, CCG, TAG, … • Lexicon development is time consuming and extremely expensive • Rarely if ever complete • Familiar knowledge acquisition bottleneck … • Treebank-based subcategorisation frame induction (LFG semantic forms) from Penn-II and –III • Parser-based induction from British National Corpus (BNC) • Evaluation against COMLEX, OALD, Korhonen’s data set Treebank-Based LFG Resources
Grammar and Lexicon Extraction: Penn-II & LFG • Lexicon Construction • Manual vs. Automated • Our Approach: • Subcat Frames not Predefined • Functional and/or Categorial Information • Parameterised for Prepositions and Particles • Active and Passive • Long Distance Dependencies • Conditional Probabilities Treebank-Based LFG Resources
Grammar and Lexicon Extraction: Penn-II & LFG Treebank-Based LFG Resources
Grammar and Lexicon Extraction: Penn-II & LFG apply<SUBJ,OBL:for> win<SUBJ,OBJ> Treebank-Based LFG Resources
Grammar and Lexicon Extraction: Penn-II & LFG Lexicon extracted from Penn-II (O’Donovan et al 2005): Treebank-Based LFG Resources
Grammar and Lexicon Extraction: Penn-II & LFG Treebank-Based LFG Resources
Grammar and Lexicon Extraction: Penn-II & LFG Parsing-Based Subcat Frame Extraction (O’Donovan 2006): • Treebank-based vs. parsing-based subcat frame extraction • Parsed British National Corpus BNC (100 million words) with our automatically induced LFGs • 19 days on single machine: ~5 million words per day • Subcat frame extraction for ~10,000 verb lemmas • Evaluation against COMLEX and OALD • Evaluation against Korhonen (2002) gold standard • Our method is statistically significantly better than Korhonen (2002) Treebank-Based LFG Resources
Parsing: Penn-II and LFG • Overview Parsing Architectures: Pipeline & Integrated • Long-Distance Dependency (LDD) Resolution at F-Structure • Evaluation & Comparison with Hand-Crafted Resources (XLE and RASP) • Comparison against Treebank-Based CCG and HPSG Resources Treebank-Based LFG Resources
Parsing: Penn-II and LFG Treebank-Based LFG Resources
Lexical-Functional Grammar (LFG) Treebank-Based LFG Resources
Parsing: Penn-II and LFG • Require: • subcategorisation frames (O’Donovan et al., 2004, 2005; O’Donovan 2006) • functional uncertainty equations • Previous Example: • claim([subj,comp]), deny([subj,obj]) • topicrel = comp* obj (search along a path of 0 or more comps) Treebank-Based LFG Resources
Parsing: Penn-II and LFG Subcat frames: as above (O’Donovan et al. 2004, 2005) Functional Uncertainty equations: • Automatically acquire finite approximations of FU-equations • Extract paths between co-indexed material in automatically generated f-structures from sections 02-21 from Penn-II • 26 TOPIC, 60 TOPICREL, 13 FOCUS path types • 99.69% coverage of paths in WSJ Section 23 • Each path type associated with a probability LDD resolution ranked by Path x Subcat probabilities (Cahill et al., 2004) Treebank-Based LFG Resources
Parsing: Penn-II and LFG • How do treebank-based constraint grammars compare to deep hand-crafted grammars like XLE and RASP? • XLE (Riezler et al. 2002, Kaplan et al. 2004) • hand-crafted, wide-coverage, deep, state-of-the-art English LFG and XLE parsing system with log-linear-based probability models for disambiguation • PARC 700 Dependency Bank gold standard (King et al. 2003), Penn-II Section 23-based • RASP (Carroll and Briscoe 2002) • hand-crafted, wide-coverage, deep, state-of-the-art English probabilistic unification grammar and parsing system (RASP Rapid Accurate Statistical Parsing) • CBS 500 Dependency Bank gold standard (Carroll, Briscoe and Sanfillippo 1999), Susanne-based Treebank-Based LFG Resources
Parsing: Penn-II and LFG • (Bikel 2002) retrained to retain Penn-II functional tags (-SBJ, -SBJ, -LOC,-TMP, -CLR, -LGS, etc.) • Pipeline architecture: • tag textBikel retrained + f-structure annotation algorithm + LDD resolution f-structures automatic conversion evaluation against XLE/RASP gold standards PARC-700/CBS-500 Dependency Banks Treebank-Based LFG Resources
Parsing: Penn-II and LFG • Systematic differences between f-structures and PARC 700 and CBS 500 dependency representations • Automatic conversion of f-structures to PARC 700 / CBS 500 -like structures (Burke et al. 2004, Burke 2006, Cahill et al. 2008) • Evaluation software (Crouch et al. 2002) and (Carroll and Briscoe 2002) • Approximate Randomisation Test (Noreen 1989) for statistical significance Treebank-Based LFG Resources
Parsing: Penn-II and LFG • Result dependency f-scores (CL 2008 paper): PARC 700 XLE vs. DCU-LFG • 80.55% XLE • 82.73% DCU-LFG (+2.18%) CBS 500 RASP vs. DCU-LFG • 76.57% RASP • 80.23% DCU-LFG (+3.66%) • Results statistically significant at 95% level (Noreen 1989) • Best result now against PARC 700 84.00% (+3.45%) Charniak + Reranker + Grzegorz’ Penn-II function-tag labeler Treebank-Based LFG Resources
Parsing: Penn-II and LFG PARC 700 Evaluation: Treebank-Based LFG Resources
Parsing: Penn-II and LFG Treebank-Based LFG Resources
Parsing: Penn-II and LFG Treebank-Based LFG Resources