1 / 14

Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer

Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer. Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland Treebank Workshop NAACL 2007. Lexical-Functional Grammar (LFG).

dutch
Download Presentation

Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland Treebank Workshop NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources

  2. Lexical-Functional Grammar (LFG) • “Shallow” grammar: defines language (set of strings) • “Deep” Grammar: as above + maps strings to “meaning” representation: predicate-argument structure, dependencies, simple logical form …, usually involves some form of long-distance dependency (LDD) resolution • Deep grammars (HPSG, LFG, CCG, TAG …) usually hand-crafted • Very difficult & expensive to scale to unrestricted text • Motivation for treebank-based deep grammar acquisition (LFG/CCG/HPSG/TAG/DepGr/…)!! • LFG: [Kaplan and Bresnan, 82; Dalrymple, 2001; Bresnan, 2001] • Constraint-based (“unification”), lexicalised • c(onstituent)-str & f(unctional) structure • c-str: surface configuration (CFG trees) • f-str: abstract grammatical functions/relations (SUBJ, OBJ, OBL, COMP, XCOMP, ADJN, POSS, APP, …) • f-str: AVM (feature-structure) encoding of dependencies/pred-arg. Treebank-Based Acquisition of Multilingual LFG Resources

  3. Lexical-Functional Grammar LFG Treebank-Based Acquisition of Multilingual LFG Resources

  4. Lexical-Functional Grammar LFG • Treebank: trees • How do we get from trees to f-structures? • What’s missing is the equations! • Automatic f-structure annotation algorithm • Traverses tree and assigns LFG equations • Principle-based c-str/f-str interface Treebank-Based Acquisition of Multilingual LFG Resources

  5. F-Structure Annotation Algorithm • Algorithm exploits: • Categorial information (NP, VP, VBZ, …) • Configurational information: • Local head, left/right of head • Leftmost NP sister to right of V(erbal) head: (OBJ)= • Morphological information: • Him: (OBJ)= • “Functional” tag information: • -LGS (PASSIVE)=+ , -SBJ, -CLR, … • Trace/co-indexation information • Translate traces + co-indexation to corresponding re-entrancies at f-str. Treebank-Based Acquisition of Multilingual LFG Resources

  6. F-Structure Annotation Algorithm Lemmatization + Macros Lexical Entries Defaults – “Functional Tags” Head-Lexicalization [Magerman,1994] Left-Right Context Annotation Principles Proto F-Structures Coordination Annotation Principles Proper F-Structures Catch-All and Clean-Up Traces Treebank-Based Acquisition of Multilingual LFG Resources

  7. Treebank Annotation: Control & Wh-Rel. LDD Treebank-Based Acquisition of Multilingual LFG Resources

  8. Multilingual Treebank-Based LFG Resources • English + Penn-II: parsers (+ LDD resolution), generators, subcat-frame extraction, bootstrapping of new TB-resources (QuestionBank), transfer • Pilots/proof of concept: multilingual treebank-based LFG acquisition: • German: TIGER (Cahill et al 2003, 2005) • Chinese: CTB (Burke et al 2004) • Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006) • GramLab Project (2005-2008): Chinese, Japanese, Arabic, Spanish, French and German Treebank-Based Acquisition of Multilingual LFG Resources

  9. Multilingual Treebank-Based LFG Resources Language Treebank English Penn-II Chinese CTB 5.1 Japanese KTC 4.0 German TIGER 2.0 German TűBa-D/Z Spanish Cast3LB Arabic ATB French P7T Size Coding/Data 50,000 CFG+traces+FT 18,000 CFG+traces+FT 38,000 Dep (+traces) 50,000 Graphs+CFG+Dep 22,000 CFG+Dep+f-traces 3,500 CFG+Dep+f-traces 300,000 (words) 20,000 CFG+Dep+f-traces --------  > 200,000 Treebank-Based Acquisition of Multilingual LFG Resources

  10. Q2 • What was missing in TB resource? • F-structures, pred-argument structure, dependencies => f-structure annotation algorithm • Limited domain in Penn-II (most treebanks …) => bootstrap grammar and QuestionBank (4000 questions from TREC and CCG) • GFs, active/passive, decl/interrog/imp, control, raising, LDDs, pro-drop, zero-anaphora, tense/aspect, … • What was done by hand? • F-structure annotation algorithm (principle-based c-/f-str interface) • No restructuring, no clean-up of TB (unlike CCG/HPSG/TAG – but see P7T) • No manual additions (unlike CCG/HPSG/TAG) • Future work … Treebank-Based Acquisition of Multilingual LFG Resources

  11. Q3 • Methodological Issues - Quality Assurance: • Evaluation against hand-crafted/corrected Gold Standard DepBanks • PARC 700 • CBS 500 • PropBank • Own Gold standard DepBanks for: English, Chinese, Japanese, German, Arabic, Spanish, French (200-500) • CCG-style evaluation against automatically annotated Gold (Silver-) Standard DepBanks based on WSJ Sec. 23 trees (CCG, HPSG) • Quality of annotation process and parsing resources: treebank-based LFG parsing statistically significantly outperform XLE and RASP (PARC 700 & CBS 500) Treebank-Based Acquisition of Multilingual LFG Resources

  12. Q4 • Phrase Structure or Dependencies? • Both!!!  Why?: • Phrase Structure good for parsing and generation => tab into lots of mature, efficient & well understood technology (but see dependency parsing) • Dependencies close to f-structure/predicate-argument structures … • Penn-II: CFG-trees + traces/co-indexation + “functional” labels/tags • TIGER: graphs + CFG-categories + grammatical function labels + LDDs through crossing edges • Cast3LB/P7T/TűBa-DZ: CFG trees + grammatical function labels + LDDs through GF paths Treebank-Based Acquisition of Multilingual LFG Resources

  13. Q5 & Q6 • Pros/Cons Formalism-Specific Treebank? • Formalism-Specific Treebank? Bad!  Limits usefulness/user group/… • Better to have generic TB with CFG + Dep Label + LDDs + other feature labels (as required). And then extract LFG/HPSG/CCG/TAG/Dependency Grammars • Grammar First vs. Treebank First? • Depends on what you want to do … • If you want high-quality, wide-coverage resources (that can parse unrestricted text) then its definitely better to do treebanking-first (or use bootstrapping) • Problem: many traditionally trained linguists see TreeBanking as menial task • Highly qualified and interesting task: empirical linguistics: confront/rather than invent data • Sociological task: how to make treebanking/bootstrapping sexy? Treebank-Based Acquisition of Multilingual LFG Resources

  14. Some Resources • ESSLLI 2006 course material: Treebank-Based Acquisition of LFG, HPSG and CCG Resources. J. van Genabith, Y. Miyao and J. Hockenmaier • http://www.computing.dcu.ie/~josef/Malaga06.ppt • LFG parser demo: • http://lfg-demo.computing.dcu.ie/lfgparser.html • A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG-Approximations, COLING/ACL 2006, Sydney, Australia • J. Judge, A. Cahill and J. van Genabith, QuestionBank:Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia • R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005 • A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Kluwer Academic Press, 2005 • R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005 • M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLING-18 Conference, Waseda University, Tokyo, Japan, pages 161-172, 2004 • A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of ACL-04, pp. 320-7, Barcelona, Spain, 2004 • Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): LFG’02, Athens, Greece, CSLI Publications, Stanford, CA., pp.76--95. 2002 Treebank-Based Acquisition of Multilingual LFG Resources

More Related