140 likes | 314 Views
Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer. Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland Treebank Workshop NAACL 2007. Lexical-Functional Grammar (LFG).
Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland Treebank Workshop NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources
Lexical-Functional Grammar (LFG) • “Shallow” grammar: defines language (set of strings) • “Deep” Grammar: as above + maps strings to “meaning” representation: predicate-argument structure, dependencies, simple logical form …, usually involves some form of long-distance dependency (LDD) resolution • Deep grammars (HPSG, LFG, CCG, TAG …) usually hand-crafted • Very difficult & expensive to scale to unrestricted text • Motivation for treebank-based deep grammar acquisition (LFG/CCG/HPSG/TAG/DepGr/…)!! • LFG: [Kaplan and Bresnan, 82; Dalrymple, 2001; Bresnan, 2001] • Constraint-based (“unification”), lexicalised • c(onstituent)-str & f(unctional) structure • c-str: surface configuration (CFG trees) • f-str: abstract grammatical functions/relations (SUBJ, OBJ, OBL, COMP, XCOMP, ADJN, POSS, APP, …) • f-str: AVM (feature-structure) encoding of dependencies/pred-arg. Treebank-Based Acquisition of Multilingual LFG Resources
Lexical-Functional Grammar LFG Treebank-Based Acquisition of Multilingual LFG Resources
Lexical-Functional Grammar LFG • Treebank: trees • How do we get from trees to f-structures? • What’s missing is the equations! • Automatic f-structure annotation algorithm • Traverses tree and assigns LFG equations • Principle-based c-str/f-str interface Treebank-Based Acquisition of Multilingual LFG Resources
F-Structure Annotation Algorithm • Algorithm exploits: • Categorial information (NP, VP, VBZ, …) • Configurational information: • Local head, left/right of head • Leftmost NP sister to right of V(erbal) head: (OBJ)= • Morphological information: • Him: (OBJ)= • “Functional” tag information: • -LGS (PASSIVE)=+ , -SBJ, -CLR, … • Trace/co-indexation information • Translate traces + co-indexation to corresponding re-entrancies at f-str. Treebank-Based Acquisition of Multilingual LFG Resources
F-Structure Annotation Algorithm Lemmatization + Macros Lexical Entries Defaults – “Functional Tags” Head-Lexicalization [Magerman,1994] Left-Right Context Annotation Principles Proto F-Structures Coordination Annotation Principles Proper F-Structures Catch-All and Clean-Up Traces Treebank-Based Acquisition of Multilingual LFG Resources
Treebank Annotation: Control & Wh-Rel. LDD Treebank-Based Acquisition of Multilingual LFG Resources
Multilingual Treebank-Based LFG Resources • English + Penn-II: parsers (+ LDD resolution), generators, subcat-frame extraction, bootstrapping of new TB-resources (QuestionBank), transfer • Pilots/proof of concept: multilingual treebank-based LFG acquisition: • German: TIGER (Cahill et al 2003, 2005) • Chinese: CTB (Burke et al 2004) • Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006) • GramLab Project (2005-2008): Chinese, Japanese, Arabic, Spanish, French and German Treebank-Based Acquisition of Multilingual LFG Resources
Multilingual Treebank-Based LFG Resources Language Treebank English Penn-II Chinese CTB 5.1 Japanese KTC 4.0 German TIGER 2.0 German TűBa-D/Z Spanish Cast3LB Arabic ATB French P7T Size Coding/Data 50,000 CFG+traces+FT 18,000 CFG+traces+FT 38,000 Dep (+traces) 50,000 Graphs+CFG+Dep 22,000 CFG+Dep+f-traces 3,500 CFG+Dep+f-traces 300,000 (words) 20,000 CFG+Dep+f-traces -------- > 200,000 Treebank-Based Acquisition of Multilingual LFG Resources
Q2 • What was missing in TB resource? • F-structures, pred-argument structure, dependencies => f-structure annotation algorithm • Limited domain in Penn-II (most treebanks …) => bootstrap grammar and QuestionBank (4000 questions from TREC and CCG) • GFs, active/passive, decl/interrog/imp, control, raising, LDDs, pro-drop, zero-anaphora, tense/aspect, … • What was done by hand? • F-structure annotation algorithm (principle-based c-/f-str interface) • No restructuring, no clean-up of TB (unlike CCG/HPSG/TAG – but see P7T) • No manual additions (unlike CCG/HPSG/TAG) • Future work … Treebank-Based Acquisition of Multilingual LFG Resources
Q3 • Methodological Issues - Quality Assurance: • Evaluation against hand-crafted/corrected Gold Standard DepBanks • PARC 700 • CBS 500 • PropBank • Own Gold standard DepBanks for: English, Chinese, Japanese, German, Arabic, Spanish, French (200-500) • CCG-style evaluation against automatically annotated Gold (Silver-) Standard DepBanks based on WSJ Sec. 23 trees (CCG, HPSG) • Quality of annotation process and parsing resources: treebank-based LFG parsing statistically significantly outperform XLE and RASP (PARC 700 & CBS 500) Treebank-Based Acquisition of Multilingual LFG Resources
Q4 • Phrase Structure or Dependencies? • Both!!! Why?: • Phrase Structure good for parsing and generation => tab into lots of mature, efficient & well understood technology (but see dependency parsing) • Dependencies close to f-structure/predicate-argument structures … • Penn-II: CFG-trees + traces/co-indexation + “functional” labels/tags • TIGER: graphs + CFG-categories + grammatical function labels + LDDs through crossing edges • Cast3LB/P7T/TűBa-DZ: CFG trees + grammatical function labels + LDDs through GF paths Treebank-Based Acquisition of Multilingual LFG Resources
Q5 & Q6 • Pros/Cons Formalism-Specific Treebank? • Formalism-Specific Treebank? Bad! Limits usefulness/user group/… • Better to have generic TB with CFG + Dep Label + LDDs + other feature labels (as required). And then extract LFG/HPSG/CCG/TAG/Dependency Grammars • Grammar First vs. Treebank First? • Depends on what you want to do … • If you want high-quality, wide-coverage resources (that can parse unrestricted text) then its definitely better to do treebanking-first (or use bootstrapping) • Problem: many traditionally trained linguists see TreeBanking as menial task • Highly qualified and interesting task: empirical linguistics: confront/rather than invent data • Sociological task: how to make treebanking/bootstrapping sexy? Treebank-Based Acquisition of Multilingual LFG Resources
Some Resources • ESSLLI 2006 course material: Treebank-Based Acquisition of LFG, HPSG and CCG Resources. J. van Genabith, Y. Miyao and J. Hockenmaier • http://www.computing.dcu.ie/~josef/Malaga06.ppt • LFG parser demo: • http://lfg-demo.computing.dcu.ie/lfgparser.html • A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG-Approximations, COLING/ACL 2006, Sydney, Australia • J. Judge, A. Cahill and J. van Genabith, QuestionBank:Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia • R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005 • A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Kluwer Academic Press, 2005 • R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005 • M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLING-18 Conference, Waseda University, Tokyo, Japan, pages 161-172, 2004 • A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of ACL-04, pp. 320-7, Barcelona, Spain, 2004 • Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): LFG’02, Athens, Greece, CSLI Publications, Stanford, CA., pp.76--95. 2002 Treebank-Based Acquisition of Multilingual LFG Resources