330 likes | 527 Views
Beyond parse trees: The Prague Dependency Treebank. Jan Haji č. The Prague Dependency Treebank Project (Czech Language Treebank). 1996-2004 1998 PDT v. 0.5 released (JHU workshop) 400k words annotated, unchecked 2001 PDT 1.0 released (LDC): 1.3MW annotated, morphology & surface syntax
E N D
The Prague Dependency Treebank Project (Czech Language Treebank) • 1996-2004 • 1998 PDT v. 0.5 released (JHU workshop) • 400k words annotated, unchecked • 2001 PDT 1.0 released (LDC): • 1.3MW annotated, morphology & surface syntax • 2004 PDT 2.0 release planned • 0.8MW annotated, underlying (deep) syntax: the “tectogrammatical layer” • ?2004 MT Resources CD: RD, PTB Cz, Tools CLSP Tuesday Seminar
Annotation Layers • Morphology • Tag (full morphology, 13 categories), lemma • Analytical layer (surface syntax) • Dependency, analytical function • Tectogrammatical layer (underlying syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order) CLSP Tuesday Seminar
Morphological Annotation • 13 categories: CLSP Tuesday Seminar
Layer 1: Morphology Ex.: “(to) the most uninteresting” • Tag: 13 categories • Example: AAFP3----3N---- Adjective no poss. Gendernegated Regular no poss. Numberno voice Feminine no personreserve1 Pluralno tensereserve2 Dative superlativebase var. • Lemma: unique identifier • Books/verb -> book-1, went -> go, to/prep. -> To-1 CLSP Tuesday Seminar
governor dependent Layer 2: Analytical syntax • Dependency + Analytical Function The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. CLSP Tuesday Seminar
Analytical functions • Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom • AuxT, AuxR, AuxO, AuxZ, AuxY • AuxP, AuxC • AuxX, AuxS, AuxG, AuxK • AtrAdv, AdvAtr; AtrObj, ObjAtr, AtrAtr • ExD • Coord, Apos; ..._Co, ..._Ap; ..._Pa CLSP Tuesday Seminar
Layer 3: Tectogrammatical • Underlying (deep) syntax • 4 sublayers: • dependency structure, (detailed) functors • topic/focus and deep word order • coreference (mostly grammatical only) • all the rest (grammatemes): • detailed functors • underlying gender, number, ... CLSP Tuesday Seminar
Dependency structure • Similar to the surface (Analytical) layer... ...but: • certain nodes deleted • auxiliaries, non-autosemantic words, punctuation • some nodes added • based on word (mostly verb, noun) valency • some ellipsis resolution • detailed dependency relation labels (functors) CLSP Tuesday Seminar
Underlying verb + tense Deep function Elided Actor in Another ellipsis... Prepositions out Analytical vs. Tectogrammatical annotation (TR: sublayer 1 only shown) (TR: sublayer 1 only shown) CLSP Tuesday Seminar
Tectogrammatical Functors • “Actants”: ACT, PAT, EFF, ADDR, ORIG • cannot repeat in a clause, usually compulsory • Free modifications (~ 50) • can repeat; optional, sometimes compulsory • Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RESTR, DESC; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, • Special • Coordination, Rhematizers, Foreign phrases,... CLSP Tuesday Seminar
Tectogrammatical Example • Analytical verb form: • (he) allowed would-be to-be enrolled • směl by být zapsán Collapsed Additional attributes (grammatemes): conditional + “allow” CLSP Tuesday Seminar
Tectogrammatical Example • Passive construction (action) • (The) book has-been translated [by Mr. X] • Kniha byla přeložena Disappeared Added CLSP Tuesday Seminar
Tectogrammatical Example • Object • (he) gave him a-book • dal mu knihu Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor’s valency frame CLSP Tuesday Seminar
Tectogrammatical Example • Incomplete phrases • Peter works well , but Paul badly • Petr pracuje dobře, ale Pavel špatně Added CLSP Tuesday Seminar
The Valency Lexicon • Valency frames • each verb (+ some nouns, adjectives) • has “slots” for functor/form pairs: • Basic set prepared in advance, annotators add entries on-the-go, checking and approval process follows (consistency) • Compare: Levin’s Classes, Proposition Bank give: ACT(Nom) PAT(Acc) ADDR(to+Dat) CLSP Tuesday Seminar
Deep word order, topic/focus • Deep word order: • from “old” information to the “new” one (left-to-right) at every level (head included) • projectivity by definition • i.e., partial level-based order -> total d.w.o. • Topic/focus/contrastive topic • attribute of every node • restricted by d.w.o. and other constraints CLSP Tuesday Seminar
Analytical dep. tree: Deep word order, topic/focus • Example: • Baker bakes rolls. vs. BakerIC bakes rolls. CLSP Tuesday Seminar
TL: Current Status (Feb. 03) • Structure, functors, some grammatemes • 350.000 words • Coreference + topic-focus • started (~10.000 words) • Everything else • 300 sentences • Plan: 55.000 sentences ~ 800.000 words • English, German (automatically), Arabic CLSP Tuesday Seminar
The Future • Lexical semantics - WSD • Czech EuroWordnet (Brno, FI MU) • 15000 nouns, 4000 verbs • currently being manually annotated (20kW) • Common representation • “language independent” • functors (ok), lemmas (??), grammatemes (?) • structure, TFA, coref: identical (?) CLSP Tuesday Seminar
Tools • Morphological dictionary + Tagger(s) • Collins parser (-> analytical level) + Afun • PTB -> AR, deterministic rules • Deterministic transformation AR->TR • Czech & English; for Cze, FUNC labeling • Baseline MT system Eng<->Cze • incl. large dictionary CLSP Tuesday Seminar
How can we use it? CLSP Tuesday Seminar
Machine Translation • Machine Translation • Source --> intermediate --> Target • Intermediate representation: [Interlingua] -> tectogrammatical -> surface synt. • less “work” in the transfer phase • more work in parsing and generation • ...but advantage in multilingual MT application CLSP Tuesday Seminar
The Basic Scheme • The additional three steps: Transfer (tectogrammatical) parsing tectogrammatical layer Generation analytical layer linearization (trivial) parsing morphological layer morphology (tagging) morph. synthesis (easy) source sentence target sentence CLSP Tuesday Seminar
Types of Correspondence • Original Czech translation too far... • 50% 1:1 • 5% 1:2, 1:0, 0:1, 2:1 • each of the other type (~90 types!) once or twice • Retranslated Czech • 90% 1:1, 1:0, 1:2, 2:1, 0:1 • rest is bad (~40 types) CLSP Tuesday Seminar
Comparing Czech and English Original Czech English Retranslated Czech Do tohoto “mikrofonu” pak začal zpívat. ‘this „mike“ Les began to sing. Do tohoto “mikrofonu” začal Les zpívat. CLSP Tuesday Seminar
Comparing Czech and Arabic CLSP Tuesday Seminar
Comparing Czech and Arabic The [Homestead’s] only remaining baker bakes the most famous roll s to the north of Long River. ‘al-xabaaz ‘al-’axiir ‘al-baaqii [fii Homestead] yaśmacu ‘ashhar ‘al-kruasaanaat ilaa shimaal min Long River. CLSP Tuesday Seminar
MT Results Czech - English CLSP Tuesday Seminar
answer Question Answering • Question: Answer: CLSP Tuesday Seminar
Question Answering • Subtree match • Except wh- words • Inclusion: • Question in answer • Answer: • Subtree corresponding to the wh- word • Yes/no questions • As above but no wh- word CLSP Tuesday Seminar
Question Answering • Synonymy • Be ~ become ~ work ~ ... • Inferences • France got Luis XIV as its king in .... ~ Luis XIV was the king of France in ... • Partial answers • Nonempty intersection • More info • Coling ‘82 paper by Jirku CLSP Tuesday Seminar
Some pointers • Current version of PDT: v1.0 • morphology + analytical level • 1.3M words (train/dev test/eval test) • http://ufal.mff.cuni.cz/pdt • Projects • http://www.ldc.upenn.edu • LDC2001T10 (PDT v1.0) • http://www.clsp.jhu.edu: Workshop 2002 • Using TL for MT Generation CLSP Tuesday Seminar