50 likes | 138 Views
Automatic Functor Assignment (AFA) in the Prague Dependency Treebank. PDT : a long term research project at the Institute of Formal and Applied Linguistics aimed at a complex annotation of a part of the Czech National Corpus annotation scheme - 3 levels: Functors:
E N D
Automatic Functor Assignment (AFA) in the Prague Dependency Treebank • PDT : • a long term research project • at the Institute of Formal and Applied Linguistics • aimed at a complex annotation of a part of the Czech National Corpus • annotation scheme - 3 levels: • Functors: • actants: ACT, PAT, ADDR, EFF, ORIG • free modifiers: TWHEN, LOC, DIR1, BEN, APP, CPR ... Raw text AFA‘s position within the PDT Morphologically tagged text Analytic tree structures (ATS) Tectogrammatical tree structures (TGTS) Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT
Problem analysis, Data preprocessing • Motivation • to reduce the huge amount of human work involved in the development of the PDT • Problem statement • to assign a functor to every node in a TGTS • Initial situation • no AFA system with a reasonable cover existed • human annotators use mostly only their language knowledge, not “formal“ rules • annotators take into account the whole-sentence context • a certain amout of manually annotated TGTSs are available • What is the minimal amount of information that is sufficient to decide about the functor ? • Problem reformulation • AFA toclassify symbolic vectors into 53 classes • Available material - 18 files (up to 50 sentences in each) • imperfect:incomplete, ambiguous • divided into two parts: • testing set - 15 files (6049 vectors) • training set - 3 files (1089 vectors) vectors with 12 symbolic attributes feature selection feature extraction + Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT
Components of the proposed AFA system • Symbiosis of 4 different approaches: • 7 Rule-based Methods (RBMs) • 3 Dictionary-based Methods (DBMs) • Nearest vector (similarity) • Machine learning (Quinlan‘s C4.5, Sašo Džeroski) • Implementation: • a set of small programs for preprocessing and format conversions, dictionary mining, functor assigning, and performance evaluation • Linux filters, Perl, SQL • assigners are applied in a strictly pipelined fashion • Data Flow Diagram: Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT
Performance evaluation • Detailed evaluation of several quantities for each assigner in a sequence • Several sequences of assigners were tested • e.g., a sequence of RBMs: • Comparison of different sequences of assigners: Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT
Further work • Machine learning - searching for new regularities • Improvement of dictionaries • Tectogrammatical annotation of verb valency frames • Categorial grammars Talks & Publications language fuzzy sets ZŽ: Fuzzy ontroller as a Tool for Traffic Simulation. Mendel 1999 ZŽ: Introduction to the PDT, Faculty of Arts, Ljubljana, 2000 ZŽ: Constrained Fuzzy Arithmetic: Engineer’s View. CMP Research Rep. ZŽ: AFA in the PDT, seminar at the IFAL, 2000 ZŽ: AFA in the PDT, TSD 2000 ZŽ: Comp. Problems of CFA,CMP seminar S. Džeroski, ZŽ: ML approach to AFA in the PDT, 5th TELRI seminar, 2000 M. Navara, ZŽ: Comp. Problems of CFA, ISCI 2000 ? S. Džeroski, ZŽ: ML approach to AFA in the PDT, ACL, 2001 M. Navara, ZŽ: How to make CFA efficient, Soft Computing 2001 Straňáková, Skoumalová, Panevová, ZŽ: Tectogram. annotation of verb. val. frames, TSD 2001 ? M. de Cock, ZŽ: Representing Ling. Hedges by L-Fuzzy Modifiers, CIMCA 2001 ? Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT