430 likes | 447 Views
Treebanks and MWEs (Part 1). Jan Haji č , Pavel Stra ňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT / C LARIN School of Computer Science Faculty of Mathematics and Physics Charles University in Prague Czech Republic. Outline. Treebanks
E N D
Treebanks and MWEs(Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science Faculty of Mathematics and Physics Charles University in Prague Czech Republic
Outline • Treebanks • Phrase-(Constituency-) based: The Penn Treebank • Dependency: The Prague Dependency Treebanks • The Penn Treebank (basics) • The Prague Dependency Treebank • Layers of Annotation • Morphology • Syntax • Semantics • Valency PARSEME Training School Prague
ThePenn treebank PARSEME Training School Prague
Phrase- vs. Dependency-Based Treebanks • The original: The Penn Treebank • Phrase-based style; good for parsing by CFG grammars • Followers • Almost all Penn-based treebanks • Chinese, Arabic, Korean, … • Negra (German), many others • Now: dependency parsing prevails • Conversion from phrase-based treebanks • Might lose information, heads added „ad hoc“ • “native” dependency treebanks: annotated as such • Considered “better” • Hindi/Urdu, TIGER (sort of); both styles manually annotated • PDT (of course) and similar ones • PDT style treebanks: Danish, Croatian, Slovene, Greek, Latin PARSEME Training School Prague
The Penn Treebank • Published (first) in 1993, now LDC99T42 (www.ldc.upenn.edu) • First the Wall Street Journal part (1 mil. words, 2312 documents) • Added other text types • ATIS corpus (dialogs, travel reservations) • Brown corpus annotated for syntax • Switchboard (spoken language, tel. conversations) PARSEME Training School Prague
POS tag (NNS) (noun, plural) Noun Phrase Phrase label (NP) Penn Treebank Format • ( (S • (NP-SBJ • (NP (NNP Pierre) (NNP Vinken) ) • (, ,) • (ADJP • (NP (CD 61) (NNS years) ) • (JJ old) ) • (, ,) ) • (VP (MD will) • (VP (VB join) • (NP (DT the) (NN board) ) • (PP-CLR (IN as) • (NP (DT a) (JJ nonexecutive) (NN director) )) • (NP-TMP (NNP Nov.) (CD 29) ))) • (. .) )) Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. “Preterminal” PARSEME Training School Prague
The Penn Treebank(s) • Extensions • Annotation of named entities, co-reference (BBN) • cf. also previous slides • Function labels (SBJ, OBJ, TMP, ...) • PropBank • Penn Treebank syntax + Predicate-argument relations, added “frame files” (predicate dictionary) (S (NP-SBJ (PRP I Arg0) VP (VBD gave Pred) (NP-DOBJ (PRP him Arg1) (NP-IOBJ (DET the) (NN book Arg2))) ... ) • NomBank • Like PropBank, but for nouns and their “arguments” • Other languages (Chinese, Arabic, ...) PARSEME Training School Prague
The Prague Dependency Treebank PARSEME Training School Prague
The Prague Dependency Treebanks: the Basics • Original Treebank: PDT 1.0, 2001 (morf., dep. syntax) • First full release: PDT 2.0 • http://ufal.mff.cuni.cz/pdt2.0 • LDC2006T01, see http://www.ldc.upenn.edu • Now: PDT 3.0: http://ufal.mff.cuni.cz/pdt3.0 • Basic general features • Multilayered annotation, interlinked layers • Dependency-based syntax (both surface and deep) • Information structure of the sentence (topic/focus) • Grammatical and basic textual coreference • New: discourse relations, MWEs • Languages: Czech, English (also parallel), Arabic • Student work on “samples”: Indonesian, Urdu, Russian, … • Spoken: work started on Czech and English (non-parallel, dialogs) PARSEME Training School Prague
The Prague Dependency Treebank • Three basic layers of annotation • Morphemic layer • Surface syntax (“analytical”) layer • “Tectogrammatical” layer: underlying syntax, semantic roles (valency), inf. structure, coreference • Size • 830,000 words (tokens) = 50000 sentences in 3165 full documents (texts) • Format • Prague Markup Language (XML-based) • Now also: .treex format • For smooth uise in the TreeX platform • http://ufal.mff.cuni.cz/treex PARSEME Training School Prague
PDT (Czech) Data • 4 sources: • Lidové noviny (daily newspaper, incl. extra sections) • DNES (Mladá fronta Dnes) (daily newspaper) • Vesmír (popular science magazine, monthly) • Českomoravský Profit (economical journal, weekly) • Full articles selected • article ~ DOCUMENT (basic corpus unit) • Time period: 1990-1995 • 1.8 million tokens (~110,000 sentences total) PARSEME Training School Prague
PDT 1.0 (2001) PDT 2.0 (2006) PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, “functor”, grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon; PDT 3.0: mass, clauses, formemes, discourse, ... PARSEME Training School Prague
PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon PARSEME Training School Prague
Morphological Attributes Ex.: nejnezajímavějším “(to) the most uninteresting” Tag: 13 categories Example: AAFP3----3N---- Adjective no poss. Gendernegated Regular no poss. Numberno voice Feminine no personreserve1 Pluralno tensereserve2 Dative superlativebase var. Lemma: POS-unique identifier Books/verb -> book-1, went -> go, to/prep. -> to-1 PARSEME Training School Prague
PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon PARSEME Training School Prague
governor dependent Layer 2 (a-layer): Analytical Syntax • Dependency + Analytical Function The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. PARSEME Training School Prague
Analytical Syntax: Functions • Main (for [main] semantic lexemes): • Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom • “Double” dependency: AtrAdv, AtrObj, AtrAtr • Special (function words, punctuation,...): • Reflexives, particles: AuxT, AuxR, AuxO, AuxZ, AuxY • Prepositions/Conjunctions: AuxP, AuxC • Punctuation, Graphics: AuxX, AuxS, AuxG, AuxK • Structural • Elipsis: ExD, Coordination etc.: Coord, Apos PARSEME Training School Prague
PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon PARSEME Training School Prague
Tectogrammatical Annotation • Underlying (deep) syntax • 5 sublayers (integrated and/or standoff annotation): • dependency structure, (detailed) functors • valency annotation • topic/focus and deep word order • coreference (mostly grammatical only) • discourse • all the rest (grammatemes): • detailed functors • underlying gender, number, mass nouns, ... • Total: 39 attributes (vs. 5 at m-layer, 2 at a-layer) PARSEME Training School Prague
Tectogrammatical vs. analytical syntax AR: All words Predicate verb “Location” TR: No function words Re-inserted elided actor of “making” In practice, that procedure will require making of certified copies. PARSEME Training School Prague
Dependency Structure • Similar to the surface (Analytical) layer... ...but: • certain nodes deleted • auxiliaries, non-autosemantic words, punctuation • (some) multiword expressions -> 1 node • some nodes added • based on word (mostly verb, noun) valency • some ellipsis resolution • detailed dependency relation labels (functors) PARSEME Training School Prague
Tectogrammatical Functors semantic “syntactic” • “Actants”: ACT, PAT, EFF, ADDR, ORIG • modify: verbs, nouns, adjectives • cannot repeat in a clause, usually obligatory • Free modifications (~ 50), semantically defined • can repeat; optional, sometimes obligatory • Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, CPHR, ... • Special • Coordination, Rhematizers, Foreign phrases (#Forn),... MWEs PARSEME Training School Prague
Analytical dep. tree: Deep Word Order Topic/Focus • Example: • Baker bakes rolls. vs. BakerIC bakes rolls. PARSEME Training School Prague
Deep Word OrderTopic/Focus • Deep word order: • from “old” information to the “new” one (left-to-right) at every level (head included) • projectivity by definition (almost...) • i.e., partial level-based order -> total d.w.o. • Topic/focus/contrastive topic • attribute of every node (t, f, c) • restricted by d.w.o. and other constraints PARSEME Training School Prague
Coreference • Grammatical (easy) • relative clauses • which, who • Peter and Paul, who ... • control • infinitival constructions • John promised to go home • reflexive pronouns • {him,her,thme}self(-ves) • Mary saw herself in ... • promise • PRED • go • John • PAT • ACT • home • he • DIR3 • ACT PARSEME Training School Prague
Coreference • Textual • Ex.: Peter moved to Iowa after he finished his PhD. PARSEME Training School Prague
Grammatemes • Detailed functors (“subfunctors”) • needed for some functors: • TWHEN: before/after • LOC: next-to, behind, in-front-of, ... • also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT • Lexical (underlying) • number (Sg/Pl), tense, modality, degree of comparison, mass-noun?; is_person_name, is_dsp_root, ... MWEs PARSEME Training School Prague
Valency in pdt PARSEME Training School Prague
Prague Dependency Treebank & Valency • Valency in the PDT • Valency lexicon for PDT • General valency lexicon • Valency in deep vs. surface syntax • Links between the layers w.r.t. valency • Valency and word sense • Sense-disambiguated occurrences: • Links from data to the lexicon • Valency in translation, text generation PARSEME Training School Prague
Definition of Valency • Ability (“desire”) of words (verbs, nouns, adjectives) to combine themselves with other units of meaning • Properties of valency: • Specific for every word meaning (in general) • leave: sb left sth for sbvs. sb left from somewhere • similar to PropBank leave.02 vs. leave.01 • Typically strongly correlates with surface form (Czech) • morphological case (~ ending), preposition+case, ...) • Semantic constraints PARSEME Training School Prague
vyměnit (to replace) vyměnit1 ACT PAT EFF Nom. Acc. za+Acc. vyměnit2 ... Structure of Valency • word (lemma) • word sense group 1 • valency frame: • slot1 slot2 slot3 • surface expression • word sense group 2 • ... PARSEME Training School Prague
PDT-VallexEntry • dosáhnout: “to reach”, “to get [sb to do sth]” • browser/user-formatted example: PARSEME Training School Prague
MWEs in PDT-Vallex • Types included: • Reflexive particle (se, si) • smát se – to laugh • všimnout si – to notice • Idiomatic constructions • dosáhnout svého - to achieve one’s goals • běhá mi mráz po zádech – to give me the shivers • Light verb constructions (and similar) • uzavřit dohodu – to agree [on sth], strike an agreement, ... • vzbuzovat pochybnosti – to doubt, to raise doubts smát_se (t_lemma) DPHR (argument) CPHR (argument) PARSEME Training School Prague
Sentence 15345: Sentence 51042: Sentence 2035: Corpus ↔ Valency Lexicon • Corpus: ENTRY: uzavřít (to close) vf1: ACT(.1) CPHR({smlouva}.4) ex: u. dohodu (close a contract) vf2: ACT(.1) PAT(.4) ex.: u. pokoj (close a room, house) • Lexicon: PARSEME Training School Prague
Valency & Text Generation • Using valency for... • ...getting the correct (lemma, tag) of verb arguments • Example: • VALLEX entry: starat (se) ACT(.1) PAT(o.[.4]) starat V.............. starat_se PRED “to take care of” o ............... Martin ....1.......... se ............... Martin ACT tygr PAT • “tiger” “Martin takes care of tigers.” tygr ....4.......... Martin se stará o tygry. PARSEME Training School Prague
Parallel Treebank Cz-EN PARSEME Training School Prague
Parallel Czech-English Annotation • English text → Czech text (human translation) • Czech side (goal): all layers manual annotation • English side (goal): • Morphology and surface syntax: technical conversion • Penn Treebank style -> PDT Analytic layer • Tectogrammatical annotation: manual annotation • (Slightly) different rules needed for English • Alignment • Natural, sentence level only (now) PARSEME Training School Prague
English Annotation POS and Syntax • Automatic conversion from Penn Treebank • PDT morphological layer • From POS tags • PDT analytic layer • From: • Penn Treebank Syntactic Structure • Non-terminal labels • Function tags (non-terminal “suffixes”) • 2-step process • Head determination rules • Conversion to dependency + analytic function PARSEME Training School Prague
Czech-English Example Dicku Darmane, zavolejte do své kanceláře! Dick Darman, call your office! PARSEME Training School Prague
SUMMARY OF PART 1/1 PARSEME Training School Prague
PDT Treebanks at UFAL (written language) • Czech • Prague Dependency Treebank • Complex annotation, all levels, additional annotation • Translation of Penn Treebank • Tectogrammatical layer only, no t/f • Analytical, morphology: automatic tool • English • Re-annotation of Penn Treebank • Other languages • Arabic (own annotation) • Other: by conversion (HamleDT – 30 treebanks) PARSEME Training School Prague
Prague Dependency Treebanks Now also in the Universal Dependency format! https://github.com/UniversalDependencies • Annotation: • 4 layers: • Words, lemmas/tags, surface dep. syntax, tectogrammatics • Tectogrammatical layer: • No function words, semantic relations • Valency/verb arguments (some MWE features) • Separate valency lexicon, fully linked from PDT nodes • Coreference, Topic/focus, Discourse • Links back to analytical layer (parsing!) PARSEME Training School Prague
Pointers • PDT 2.0 (the “Original”), newest version: PDT 3.0 • http://ufal.mff.cuni.cz/pdt2.0 • http://ufal.mff.cuni.cz/pdt3.0 • PCEDT • http://ufal.mff.cuni.cz/pcedt2.0/ • PEDT • English side of PCEDT, additional: NE, coreference • http://ufal.mff.cuni.cz/pedt2.0/ • PADT (Arabic, morphology + surface syntax) • http://ufal.mff.cuni.cz/padt • Other corpora, PDT-Vallex, EngVallex: • Search at http://lindat.cz • LDC catalog numbers: • LDC2006T01 (PDT 2.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PEDT 1.0) • CoNLL 2009 shared task (7 languages, surface syntax + predicate arguments only) • http://ufal.mff.cuni.cz/conll2009-st • HamleDT 2.0 (30 treebanks in unified format) • http://ufal.mff.cuni.cz/hamledt PARSEME Training School Prague