300 likes | 423 Views
Annotation of Grammatemes in the Prague Dependency Treebank 2.0. Magda Razímová Zden ě k Žabokrtský Institute of Formal and Applied Linguistics Charles University Prague, Czech Republic { razimova ,zabokrtsky}@ufal.mff.cuni.cz. Outline of the talk. Introduction
E N D
Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University Prague, Czech Republic {razimova,zabokrtsky}@ufal.mff.cuni.cz
Outline of the talk • Introduction • Prague Dependency Treebank 2.0 • Annotation of grammatemes • Motivation • Grammateme attributes • Two-level node hierarchy • Examples of grammateme value assignment • Final remarks razimova@ufal.mff.cuni.cz
Introduction • grammatemes in the PDT 2.0 • one type of attributes of nodes of a deep syntactic tree • capturing morphological meanings that are semantically indispensable • number for nouns, degree of comparison for adjectives, tense for verbs, etc. • annotation of grammatemes • the last task in the PDT 2.0 annotation procedure • possible to assign automatically – profiting from the already available annotation: • annotation of the same sentence at the lower layers • already available components of the t-tree (tree structure, types of dependency relations, co-reference, etc.) razimova@ufal.mff.cuni.cz
Historical backgroundand development of PDT project • mid 1960’s – Praguian Functional Generative Description (Petr Sgall et al.) • 1994 – Czech National Corpus • 1995 – PDT started • 1998 – PDT 0.5 pre-release • 2001 – PDT 1.0 released by LDC • manual annotation of morphology and surface syntax • 2006 – PDT 2.0 to be released by LDC • interlinked morphological, surface-syntactic and complex deep-syntactic annotation • including annotation of grammatemes razimova@ufal.mff.cuni.cz
Outline of the talk • Introduction • Prague Dependency Treebank 2.0 • Annotation of grammatemes • Motivation • Grammateme attributes • Two-level node hierarchy • Examples of grammateme value assignment • Final remarks razimova@ufal.mff.cuni.cz
Layers of annotation • tectogrammatical layer • deep-syntactic dependency tree • analytical layer • surface-syntactic dependency tree • morphological layer • m-lemma and m-tag associated with each token • word layer • original text, segmented on word boundaries lit: He-was would went toforest. He would have gone to the forest. razimova@ufal.mff.cuni.cz
Interlinking the layers • any unit at any layer has a PDT unique ID • neighboring layers connected by top-down pointers razimova@ufal.mff.cuni.cz lit: He-was would went toforest. He would have gone to the forest.
Size of the PDT 2.0 data (i) • 7,129 manually annotated textual documents • all documents annotated at the m-layer • 16,065 sentences with 1,960,657 tokens • 75 % of the m-layer data annotated at the a-layer • 5,338 documents, 87,980 sentences, 1,504,847 tokens • 44 % of the m-layer data annotated also at the t-layer • 3,168 documents, 49,442 sentences, 833,357 tokens razimova@ufal.mff.cuni.cz
Size of the PDT 2.0 data (ii) • training data (80 %) • development test data (10 %) • evaluation test data (10 %) razimova@ufal.mff.cuni.cz
M-layer • sentence represented as a sequence of tokens • each token lemmatized and tagged (attributes m-lemma and m-tag) • positional m-tag: 15 characters • 1. (main) POS • 2. detailed POS • 3. gender • 4. number • 5. case • ... lit.: Some contours problem(gen) reflexive_pronoun though after resurgence(instr) Havel's speech(instr) they-seem to-be clearer. Some contours of the problem seem to be clearer after the resurgence by Havel's speech. razimova@ufal.mff.cuni.cz
A-layer • rooted ordered tree with labeled nodes and edges • a-nodes • one token of the m-layer is represented by exactly one a-node • labeled with a-lemmas (identical with word forms) • a-edges • represent dependency relations (Sb, Obj, Adv, Atr) • represent non-dependency relations (Coord) • analytical function attribute appears as an a-node attribute Some contours of the problem seem to be clearer after the resurgence by Havel's speech. razimova@ufal.mff.cuni.cz
T-layer • rooted ordered tree with labeled nodes and edges • t-nodes • complex typed feature structures • represent auto-semantic words • functional words do not have nodes of their own • artificially added nodes • t-edges • dependency relations (functor) • non-dependency relations (coordination constructions) • functor attribute appears as an t-node attribute Some contours of the problem seem to be clearer after the resurgence by Havel's speech. razimova@ufal.mff.cuni.cz
Areas of annotation at the t-layer Všem bylo předáno osvědčení o úspěšném absolvování kurzu. • tree structure • t-lemma attribute • dependency relation (functor and subfunctor) • topic-focus attributes • co-reference attributes • node typing attributes (nodetype and sempos) • grammateme attributes lit. [To] all was handed over a certificate of successful graduation from the course. They all received a certificate of successful graduation from this course. razimova@ufal.mff.cuni.cz
Outline of the talk • Introduction • Prague Dependency Treebank 2.0 • Annotation of grammatemes • Motivation • Grammateme attributes • Two-level node hierarchy • Examples of grammateme value assignment • Final remarks razimova@ufal.mff.cuni.cz
Grammatemes: Motivation • grammatemes • t-node attributes representing inflectional information that is semantically indispensable (morphological meanings such as number for nouns, tense for verbs, degree of comparison for adjectives, etc.) • semantically irrelevant morphological meanings are not part of the t-layer (e.g. case for nouns) razimova@ufal.mff.cuni.cz
Grammateme attributes • 15 grammatemes • indeftype • numertype • negation • degcmp • tense • aspect • verbmod • deontmod • dispmod • resultative • iterativeness • number • gender • person • politeness razimova@ufal.mff.cuni.cz
Conditioned presence/absence of grammatemes • obviously, not all grammatemes are relevant for all nodes • no tense for dog, no degree of comparison for (he) waits, etc. • how to formally declare presence/absence of a given grammateme attribute in a given node? • the need for node typing • chosen solution: two-level typing • 1st level: 8 more general types of nodes • grammatemes relevant only for one of them • 2nd level: 19 more specific subtypes, corresponding to detailed semantic parts of speech razimova@ufal.mff.cuni.cz
Presence/absence of grammateme values:Two-level t-node hierarchy • 1st level: attribute nodetype • 2nd level: attribute sempos razimova@ufal.mff.cuni.cz
First level of the hierarchy: attribute nodetype • 8 attribute values: root |qcomplex | list | atom | coap | dphr | fphr | complex • fully automatic annotation - use of • the tree structure root • t-attributes • t-lemma qcomplex | list • functor atom | coap | dphr | fphr • else complex Levnější benzín na Východě, dražší na ZápaděCheaper gasoline in the East, more expensive one in the West razimova@ufal.mff.cuni.cz
Second level of the hierarchy: attribute sempos • only complex nodes grouped into semantic parts of speech • 19 values of the attribute sempos: • n. ... | adj. ... | adv. ... | v. ... • fully automatic annotation – use of • m-tag • t-lemma • other t-attributes • sempos value delimits the set of relevant grammatemes razimova@ufal.mff.cuni.cz
Values of nodetype and semposin the PDT 2.0 – an overview • nodetype values: • sempos values: razimova@ufal.mff.cuni.cz
Grammateme value assignment • n-tred environment for processing the PDT data http://ufal.mff.cuni.cz/˜pajas • automatic annotation • 2000 lines of Perl code • crucial importance of inter-layer links – use of • t-attributes, a-attributes, m-attributes • rules using special economic notation • 2000 lines written in a text file • lexical resources • special purpose lists of adverbs / verbs • manual annotation of special problems • two annotators working in parallel • simplified annotation environment: treebank positions extracted into simple HTML forms razimova@ufal.mff.cuni.cz
Simple HTML-basedenvironment for manual annotation lit: The difference [you] would have to pay yourself. razimova@ufal.mff.cuni.cz
Automatic vs. manual assignment • at the t-layer of the PDT 2.0: • 1,594,333 grammateme values assigned at 550,947 complex nodes • manually assigned: • 17,520 grammateme values • inter-annotator agreement: 70-85 % razimova@ufal.mff.cuni.cz
Grammateme assignment and m-tag n.denot number=sg • number grammateme: values sg | pl • assigned automatically using m-tag • e.g. les (forest) • m-layer: tag NNIS2-----A---- t-layer: number=sg • manual assignment • nouns with only plural forms (identified by a list extracted from the machine-readable dictionary of standard Czech) • e.g. dveře (door/doors) • m-layer: always plural • t-layer: annotator decision sg | pl razimova@ufal.mff.cuni.cz lit: He-was would went toforest. He would have gone to the forest.
Grammateme assignment and tree structure v verbmod=cdn • mood grammateme verbmod: values ind | imp | cdn • assigned automatically • one-word verbal forms • e.g. jde (goes) • m-tag information • verbal forms consisting of more word forms (represented by a single node at the t-layer) • e.g. byl by šel (would have gone) • corresponding a-layer subtree involves the node by • m-tag of the node by lit: He-was would went toforest. He would have gone to the forest. razimova@ufal.mff.cuni.cz
Grammateme assignment and co-reference Ze zbytku suroviny mlékárna vyrábí sušené mléko, které vyváží do Asie a Jižní Ameriky. • grammatemes gender, number and person in relative pronouns are left underspecified (value inher), since they are imposed only by grammatical agreement (thus can be “inherited from the antecedents”) lit. From remainder of raw material the diary produces dried milk, which [it] exports to Asia and South America. From the rest of the material, the diary produces dried milk, which is exported [by it] to Asia and South America. razimova@ufal.mff.cuni.cz
Outline of the talk • Introduction • Prague Dependency Treebank 2.0 • Annotation of grammatemes • Motivation • Grammateme attributes • Two-level node hierarchy • Examples of grammateme value assignment • Final remarks razimova@ufal.mff.cuni.cz
Final remarks • achievements: • two-level typing of t-layer nodes which makes it possible to formally capture presence/absence of individual grammatemes in a given node • automatic procedure for capturing the node classification and the grammateme attributes • verification of the procedure on large-scale data • experience: • it was the existence of the lower annotation layers and the existence of inter-layer links what allowed to make the procedure of grammateme assignment more or less automatic razimova@ufal.mff.cuni.cz
http://ufal.mff.cuni.cz/pdt2.0/ razimova@ufal.mff.cuni.cz