Prague Dependency Treebank (PDT) - Complex Annotation of Czech National Corpus

Zdeněk Žabokrtský e-mail: zabokrtz@cs.felk.cvut.cz Czech Technical University, Department of Computer Science the following presentation can be downloaded from http://obelix.ijs.si/ZdenekZabokrtsky/PDT/

The Prague Dependency Treebank (PDT) • long-term project aimed at a complex annotation of a part of the Czech National Corpus with rich annotation scheme • Institute of Formal and Applied Linguistics • established in 1990 at the Faculty of Mathematics and Physics, Charles University, Prague • Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, … • http://ufal.ms.mff.cuni.cz

The Prague Dependency Treebank • inspiration: • the Penn Treebank (the most widely used syntactically annotated corpus of English) • motivation: • the treebank can obviously be used for further linguistic research • more accurate results can be obtained when using annotated corpora than when using texts in their raw form (unsupervised training)

Source of the text data • provided by Institute of the Czech National Corpus (ICNC) • text sample for PDT • 456 705 tokens (words and punctuations) in 26610 sentences, divided into 576 files, 50 sentences per file • 40 % - general newspaper articles (Lidové noviny, Mladá Fronta) • 20 % - economic new and analysis (Českomoravský profit) • 20 % - popular science magazine (Vesmír) • 20 % - information technology texts • divided into • a training set (19 126 sentences) • a development test set (3 697) • a cross-evaluation test data set (3 787)

Institute of the Czech National Corpus • founded 1994 at the Faculty of Philosophy, Charles University, • head of the institute: prof. František Čermák • 100 million words • freely accessible: http://ucnk.ff.cuni.cz • querry language CQP (corpus query processor, developed at the university in Stuttgart) • regular expressions • examples of querries: disku[s|z]e .+nést

CNC: querry example • querry: .+nosit • response: tačí se trochu vybavit , <nanosit> kupu listí a sena - já ho ie Každý mistr by se měl <honosit> nějakým rekordem či jedin anční tísni by měly dítě <donosit> . Bezvýhradná povinnost p í hladovění bude schopna <donosit> plod . Mimochodem i u sou evítané těhotenství tzv. <donosit> a dítěte se vzdát ve pros mž sedíme , nepostavil . <Vynosit> tuny kamení na zádech , t byl v nebezpečí a naděje <donosit> dítě žádná . Jeden večer 6 - Živit mateř . mlékem <Nanosit> 57 - Ukončit létání 58 - odstatně větší a může se <honosit> řadou úctyhodných přívlas vy , v pokoji nekouřit , <nenosit> domů alkohol . Dodržovat ve městě , které se mělo <honosit> jen svým " dělnickým hnut ve městě , které se mělo <honosit> jen svým " dělnickým hnut . . .

Layered structure of PDT raw text • morphological level • full morphological tagging (word forms, lemmas, mor. tags) • analytical level • surface syntax • syntactic annotation using depencency syntax (captures analytical functions such as Subject, Object,...) • tectogrammatical level • level of linguistic meaning (tectogrammatical functions such as Actor, Patient,...) morphologically tagged text analytic tree structures (ATS) tectogrammatical tree structures (TGTS)

The Morphological Level • a tag and a lemma are assigned to each word form from the input text • 3030 tags (Czech is an inflectionally rich language) • 6 tag variables • number - degrees of comparison • case - person • gender - negation • example: • VPS3A - verb (indicative, present tense, sing., 3rd person, affirmative)

Morphological Analysis • an automatic process: • input: word form • output: a set of possible lemmas, each lemma accompanied by a set of possible tags • currently covers 720000 Czech lemmas, based on 210000 stems • can recognize 20 million word forms • output ambiguity: • there may be 5 different lemmas for a given word form • 27 different tags for a given lemma • example: učení - NNS1A, NNS2A, NNS3A,...,NNP5A

The whole process of morphological tagging raw text • automatic morphological analysis • manual disambiguation • 2 annotators • in the full text context • special software tool • automatic comparison • manual correction unambiguously tagged text

Standard Generalized Markup Language (SGML) a sample of DTD (Document Type Definition) related to the morphological level: <!ELEMENT MMl - O (#PCDATA & R? & E? & e? & T* & MMt*) -- lemma (base form), description see the l tag; machine assigned (by a morphological analysis program), NOT disambiguated --> <!ELEMENT MDl - O (#PCDATA & R? & E? & e? & T* & MDt*) -- lemma (base form), description see the l tag; machine assigned (by a tagger), disambiguated if more than 1: n-best --> . . . <!ELEMENT MMt - O (#PCDATA) -- morphological tag(s) as assigned by morphology, NOT disambiguated --> <!ELEMENT MDt - O (#PCDATA) -- morphological tag(s) as assigned by machine, disambiguated, possibly also with weight/prob; if more than 1: n-best --> Data Format

Example of tagged sentence • Ty mají pak někdy takovou publicitu, že to dotyčnou kancelář zlikviduje. <s id=cmpr9415:025-p19s2/bcc14zua.fs/#18> <f cap>Ty<MMl>ty<MMt>PP2S1<MMt>PP2S5<MMl>ten<MMt>... ... PDFP1<MMt>PDFP4<MMt>PDIP1<MMt>PDIP4<MMt>PDMP4<A>Sb<r>1<g>2 <f>mají<MMl>mít<MMt>VPP3A<A>Pred<r>2<g>0 <f>pak<MMl>pak<MMt>DB<A>Adv<r>3<g>2 <f>někdy<MMl>někdy<MMt>DB<A>Adv<r>4<g>2 <f>takovou<MMl>takový<MMt>AFS41A<MMt>AFS71A<A>Atr<r>5<g>6 <f>publicitu<MMl>publicita<MMt>NFS4A<A>Obj<r>6<g>2 <D> <d>,<MMl>,<MMt>ZIP<A>AuxX<r>7<g>8 <f>že<MMl>že<MMt>JS<A>AuxC<r>8<g>6 <f>to<MMl>ten<MMt>PDNS1<MMt>PDNS4<A>Sb<r>9<g>13 <f>dotyčnou<MMl>dotyčný<MMt>AFS41A<MMt>AFS71A<A>Atr<r>10<g>11 <f>kancelář<MMl>kancelář<MMt>NFS1A<MMt>NFS4A<A>Obj<r>11<g>13 <f>prakticky<MMl>prakticky_^(*1ý)<MMt>DG1A<A>Adv<r>12<g>13 <f>zlikviduje<MMl>zlikvidovat_:W<MMt>VPS3A<A>Obj<r>13<g>8 <D> <d>.<MMl>.<MMt>ZIP<A>AuxK<r>14<g>0

The Analytical Level • the dependency structure was chosen to represent the syntactic relations within the sentence. • output of the analytical level: analytical tree structure (ATS) • oriented, acyclic graph with one entry node • every word form and punctuation mark is represented as a node • the nodes are annotated by attribute-value pairs • new attribute: analytical function • determines the relation between the dependent node and its governing nodes • values: Sb, Obj, Adv, Atr,....

Example of ATS • V návrzích na případné změny vycházejí ze svých většinou několikaletých podnikatelských zkušeností.

Selected attributes of ATS’s nodes

Selected values of the analytical function

Example of tagged sentence • ...ve sledovaném období žádný okres nezlepšil svoji pozici... <f>ve<MMl>v<MMt>RV4<MMt>RV6<A>AuxP<r>4<g>9 <f>sledovaném<MMl>sledovaný_^(*2t)<MMt>AIS61A<MMt>AMS61A<MMt>ANS61A<A>Atr<r>5<g>6 <f>období<MMl>období<MMt>NNP1A<MMt>NNP2A<MMt>NNP4A<MMt>NNP5A<MMt>NNS1A<MMt>NNS2A<MMt>NNS3A<MMt>NNS4A<MMt>NNS5A<MMt>NNS6A<A>Adv<r>6<g>4 <f>žádný<MMl>ľádný<MMt>PNFIS4<MMt>PNFYS1<MMt>PNFYS5<A>Atr<r>7<g>8 <f>okres<MMl>okres<MMt>NIS1A<MMt>NIS4A<A>Sb<r>8<g>9 <f>nezlepšil<MMl>zlepąit_:W<MMt>VRYSN<A>Pred_Co<r>9<g>11 <f>pozici<MMl>pozice<MMt>NFS3A<MMt>NFS4A<MMt>NFS6A<A>Obj<r>10<g>9

The Tectogrammatical Level • based on the framework of the Functional Generative Description as developed by Petr Sgall • in comparison to the ATSs, the tectogrammatical tree structures (TGTSs) have the following characteristics: • only autosemantic words have an own node, function words (conjunctions, prepositions) are attached as indices to the autosemantic words to which they belong • nodes are added in case of clearly specified deletions on the surface level • analytical functions are substituted by tectogrammatical functions (functors), such as Actor, Patient, Addressee,...

Example of TGTS • Podle předběžných odhadů se totiž počítá, že do soukromého vlastnictví bude prodáno minimálne 10000 bytů

Selected attributes of a TGTS‘s node

Functors • tectogrammatical counterparts of analytical functions • about 40 functors in 2 groups: • actants • Actor, Patient, Adressee, Origin, Effect • free modifiers • LOC, DIR1, RSTR, TWHEN, TTIL,... • provide more detailed information about the relation to the governing node than the analytical function

Example of ATS ... • Kdo chce investovat dvě stě tisíc korun do nového automobilu, nelekne se, že benzín byl změnou zákona trochu zdražen.

... and the corresponding TGTS

Tectogrammatical tagging • 2 parallel streams ATS treebank larger set of partially tagged TGTSs (only changes of tree structure, functor and TFA assignment) smaller set of fully tagged TGTSs

Problems of automatic functor assignment • za roh - DIR3 • za hodinu - TWHEN • za svobodu - OBJ • po otci • TWHEN (Přišel po otci.) • NORM (Jmenuje se po otci.) • HER (Zdědil dům po otci.) • . . .

Summary • the current state of art: • there are several manually annotated files of TGTSs • methods for automatic transformation from ATS into TGTS form are in development Czech National Corpus morphologically tagged corpus ATS treebank TGTS treebank March, 2000 November, 1996 September, 1994

Prague Dependency Treebank (PDT) - Complex Annotation of Czech National Corpus