Zdeněk Žabokrtský e-mail: zabokrtz@cs.felk.cvut.cz

Zdeněk Žabokrtský e-mail: zabokrtz@cs.felk.cvut.cz Czech Technical University, Department of Computer Science the following presentation can be downloaded from http://obelix.ijs.si/ZdenekZabokrtsky/PDT/

The Prague Dependency Treebank (PDT) • long-term project aimed at a complex annotation of a part of the Czech National Corpus with rich annotation scheme • Institute of Formal and Applied Linguistics • established in 1990 at the Faculty of Mathematics and Physics, Charles University, Prague • Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, … • http://ufal.ms.mff.cuni.cz

The Prague Dependency Treebank • inspiration: • the Penn Treebank (the most widely used syntactically annotated corpus of English) • motivation: • the treebank can obviously be used for further linguistic research • more accurate results can be obtained when using annotated corpora than when using texts in their raw form (unsupervised training)

Source of the text data • provided by Institute of the Czech National Corpus (ICNC) • text sample for PDT • 456 705 tokens (words and punctuations) in 26610 sentences, divided into 576 files, 50 sentences per file • 40 % - general newspaper articles (Lidové noviny, Mladá Fronta) • 20 % - economic new and analysis (Českomoravský profit) • 20 % - popular science magazine (Vesmír) • 20 % - information technology texts • divided into • a training set (19 126 sentences) • a development test set (3 697) • a cross-evaluation test data set (3 787)

Institute of the Czech National Corpus • founded 1994 at the Faculty of Philosophy, Charles University, • head of the institute: prof. František Čermák • 100 million words • freely accessible: http://ucnk.ff.cuni.cz • querry language CQP (corpus query processor, developed at the university in Stuttgart) • regular expressions • examples of querries: disku[s|z]e .+nést

CNC: querry example • querry: .+nosit • response: tačí se trochu vybavit , <nanosit> kupu listí a sena - já ho ie Každý mistr by se měl <honosit> nějakým rekordem či jedin anční tísni by měly dítě <donosit> . Bezvýhradná povinnost p í hladovění bude schopna <donosit> plod . Mimochodem i u sou evítané těhotenství tzv. <donosit> a dítěte se vzdát ve pros mž sedíme , nepostavil . <Vynosit> tuny kamení na zádech , t byl v nebezpečí a naděje <donosit> dítě žádná . Jeden večer 6 - Živit mateř . mlékem <Nanosit> 57 - Ukončit létání 58 - odstatně větší a může se <honosit> řadou úctyhodných přívlas vy , v pokoji nekouřit , <nenosit> domů alkohol . Dodržovat ve městě , které se mělo <honosit> jen svým " dělnickým hnut ve městě , které se mělo <honosit> jen svým " dělnickým hnut . . .

Layered structure of PDT raw text • morphological level • full morphological tagging (word forms, lemmas, mor. tags) • analytical level • surface syntax • syntactic annotation using depencency syntax (captures analytical functions such as Subject, Object,...) • tectogrammatical level • level of linguistic meaning (tectogrammatical functions such as Actor, Patient,...) morphologically tagged text analytic tree structures (ATS) tectogrammatical tree structures (TGTS)

The Morphological Level • a tag and a lemma are assigned to each word form from the input text • 3030 tags (Czech is an inflectionally rich language) • 6 tag variables • number - degrees of comparison • case - person • gender - negation • example: • VPS3A - verb (indicative, present tense, sing., 3rd person, affirmative)

Morphological Analysis • an automatic process: • input: word form • output: a set of possible lemmas, each lemma accompanied by a set of possible tags • currently covers 720000 Czech lemmas, based on 210000 stems • can recognize 20 million word forms • output ambiguity: • there may be 5 different lemmas for a given word form • 27 different tags for a given lemma • example: učení - NNS1A, NNS2A, NNS3A,...,NNP5A

The whole process of morphological tagging raw text • automatic morphological analysis • manual disambiguation • 2 annotators • in the full text context • special software tool • automatic comparison • manual correction unambiguously tagged text

Standard Generalized Markup Language (SGML) a sample of DTD (Document Type Definition) related to the morphological level: <!ELEMENT MMl - O (#PCDATA & R? & E? & e? & T* & MMt*) -- lemma (base form), description see the l tag; machine assigned (by a morphological analysis program), NOT disambiguated --> <!ELEMENT MDl - O (#PCDATA & R? & E? & e? & T* & MDt*) -- lemma (base form), description see the l tag; machine assigned (by a tagger), disambiguated if more than 1: n-best --> . . . <!ELEMENT MMt - O (#PCDATA) -- morphological tag(s) as assigned by morphology, NOT disambiguated --> <!ELEMENT MDt - O (#PCDATA) -- morphological tag(s) as assigned by machine, disambiguated, possibly also with weight/prob; if more than 1: n-best --> Data Format

Example of tagged sentence • Ty mají pak někdy takovou publicitu, že to dotyčnou kancelář zlikviduje. <s id=cmpr9415:025-p19s2/bcc14zua.fs/#18> <f cap>Ty<MMl>ty<MMt>PP2S1<MMt>PP2S5<MMl>ten<MMt>... ... PDFP1<MMt>PDFP4<MMt>PDIP1<MMt>PDIP4<MMt>PDMP4<A>Sb<r>1<g>2 <f>mají<MMl>mít<MMt>VPP3A<A>Pred<r>2<g>0 <f>pak<MMl>pak<MMt>DB<A>Adv<r>3<g>2 <f>někdy<MMl>někdy<MMt>DB<A>Adv<r>4<g>2 <f>takovou<MMl>takový<MMt>AFS41A<MMt>AFS71A<A>Atr<r>5<g>6 <f>publicitu<MMl>publicita<MMt>NFS4A<A>Obj<r>6<g>2 <D> <d>,<MMl>,<MMt>ZIP<A>AuxX<r>7<g>8 <f>že<MMl>že<MMt>JS<A>AuxC<r>8<g>6 <f>to<MMl>ten<MMt>PDNS1<MMt>PDNS4<A>Sb<r>9<g>13 <f>dotyčnou<MMl>dotyčný<MMt>AFS41A<MMt>AFS71A<A>Atr<r>10<g>11 <f>kancelář<MMl>kancelář<MMt>NFS1A<MMt>NFS4A<A>Obj<r>11<g>13 <f>prakticky<MMl>prakticky_^(*1ý)<MMt>DG1A<A>Adv<r>12<g>13 <f>zlikviduje<MMl>zlikvidovat_:W<MMt>VPS3A<A>Obj<r>13<g>8 <D> <d>.<MMl>.<MMt>ZIP<A>AuxK<r>14<g>0

The Analytical Level • the dependency structure was chosen to represent the syntactic relations within the sentence. • output of the analytical level: analytical tree structure (ATS) • oriented, acyclic graph with one entry node • every word form and punctuation mark is represented as a node • the nodes are annotated by attribute-value pairs • new attribute: analytical function • determines the relation between the dependent node and its governing nodes • values: Sb, Obj, Adv, Atr,....

Example of ATS • V návrzích na případné změny vycházejí ze svých většinou několikaletých podnikatelských zkušeností.

Selected attributes of ATS’s nodes

Selected values of the analytical function

Example of tagged sentence • ...ve sledovaném období žádný okres nezlepšil svoji pozici... <f>ve<MMl>v<MMt>RV4<MMt>RV6<A>AuxP<r>4<g>9 <f>sledovaném<MMl>sledovaný_^(*2t)<MMt>AIS61A<MMt>AMS61A<MMt>ANS61A<A>Atr<r>5<g>6 <f>období<MMl>období<MMt>NNP1A<MMt>NNP2A<MMt>NNP4A<MMt>NNP5A<MMt>NNS1A<MMt>NNS2A<MMt>NNS3A<MMt>NNS4A<MMt>NNS5A<MMt>NNS6A<A>Adv<r>6<g>4 <f>žádný<MMl>ľádný<MMt>PNFIS4<MMt>PNFYS1<MMt>PNFYS5<A>Atr<r>7<g>8 <f>okres<MMl>okres<MMt>NIS1A<MMt>NIS4A<A>Sb<r>8<g>9 <f>nezlepšil<MMl>zlepąit_:W<MMt>VRYSN<A>Pred_Co<r>9<g>11 <f>pozici<MMl>pozice<MMt>NFS3A<MMt>NFS4A<MMt>NFS6A<A>Obj<r>10<g>9

The Tectogrammatical Level • based on the framework of the Functional Generative Description as developed by Petr Sgall • in comparison to the ATSs, the tectogrammatical tree structures (TGTSs) have the following characteristics: • only autosemantic words have an own node, function words (conjunctions, prepositions) are attached as indices to the autosemantic words to which they belong • nodes are added in case of clearly specified deletions on the surface level • analytical functions are substituted by tectogrammatical functions (functors), such as Actor, Patient, Addressee,...

Example of TGTS • Podle předběžných odhadů se totiž počítá, že do soukromého vlastnictví bude prodáno minimálne 10000 bytů

Selected attributes of a TGTS‘s node

Functors • tectogrammatical counterparts of analytical functions • about 40 functors in 2 groups: • actants • Actor, Patient, Adressee, Origin, Effect • free modifiers • LOC, DIR1, RSTR, TWHEN, TTIL,... • provide more detailed information about the relation to the governing node than the analytical function

Example of ATS ... • Kdo chce investovat dvě stě tisíc korun do nového automobilu, nelekne se, že benzín byl změnou zákona trochu zdražen.

... and the corresponding TGTS

Tectogrammatical tagging • 2 parallel streams ATS treebank larger set of partially tagged TGTSs (only changes of tree structure, functor and TFA assignment) smaller set of fully tagged TGTSs

Problems of automatic functor assignment • za roh - DIR3 • za hodinu - TWHEN • za svobodu - OBJ • po otci • TWHEN (Přišel po otci.) • NORM (Jmenuje se po otci.) • HER (Zdědil dům po otci.) • . . .

Summary • the current state of art: • there are several manually annotated files of TGTSs • methods for automatic transformation from ATS into TGTS form are in development Czech National Corpus morphologically tagged corpus ATS treebank TGTS treebank March, 2000 November, 1996 September, 1994

Zdeněk Žabokrtský e-mail: zabokrtz@cs.felk.cvut.cz