1 / 43

Treebanks and MWEs (Part 1)

Treebanks and MWEs (Part 1). Jan Haji č , Pavel Stra ňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT / C LARIN School of Computer Science Faculty of Mathematics and Physics Charles University in Prague Czech Republic. Outline. Treebanks

aureliar
Download Presentation

Treebanks and MWEs (Part 1)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Treebanks and MWEs(Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science Faculty of Mathematics and Physics Charles University in Prague Czech Republic

  2. Outline • Treebanks • Phrase-(Constituency-) based: The Penn Treebank • Dependency: The Prague Dependency Treebanks • The Penn Treebank (basics) • The Prague Dependency Treebank • Layers of Annotation • Morphology • Syntax • Semantics • Valency PARSEME Training School Prague

  3. ThePenn treebank PARSEME Training School Prague

  4. Phrase- vs. Dependency-Based Treebanks • The original: The Penn Treebank • Phrase-based style; good for parsing by CFG grammars • Followers • Almost all Penn-based treebanks • Chinese, Arabic, Korean, … • Negra (German), many others • Now: dependency parsing prevails • Conversion from phrase-based treebanks • Might lose information, heads added „ad hoc“ • “native” dependency treebanks: annotated as such • Considered “better” • Hindi/Urdu, TIGER (sort of); both styles manually annotated • PDT (of course) and similar ones • PDT style treebanks: Danish, Croatian, Slovene, Greek, Latin PARSEME Training School Prague

  5. The Penn Treebank • Published (first) in 1993, now LDC99T42 (www.ldc.upenn.edu) • First the Wall Street Journal part (1 mil. words, 2312 documents) • Added other text types • ATIS corpus (dialogs, travel reservations) • Brown corpus annotated for syntax • Switchboard (spoken language, tel. conversations) PARSEME Training School Prague

  6. POS tag (NNS) (noun, plural) Noun Phrase Phrase label (NP) Penn Treebank Format • ( (S • (NP-SBJ • (NP (NNP Pierre) (NNP Vinken) ) • (, ,) • (ADJP • (NP (CD 61) (NNS years) ) • (JJ old) ) • (, ,) ) • (VP (MD will) • (VP (VB join) • (NP (DT the) (NN board) ) • (PP-CLR (IN as) • (NP (DT a) (JJ nonexecutive) (NN director) )) • (NP-TMP (NNP Nov.) (CD 29) ))) • (. .) )) Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. “Preterminal” PARSEME Training School Prague

  7. The Penn Treebank(s) • Extensions • Annotation of named entities, co-reference (BBN) • cf. also previous slides • Function labels (SBJ, OBJ, TMP, ...) • PropBank • Penn Treebank syntax + Predicate-argument relations, added “frame files” (predicate dictionary) (S (NP-SBJ (PRP I Arg0) VP (VBD gave Pred) (NP-DOBJ (PRP him Arg1) (NP-IOBJ (DET the) (NN book Arg2))) ... ) • NomBank • Like PropBank, but for nouns and their “arguments” • Other languages (Chinese, Arabic, ...) PARSEME Training School Prague

  8. The Prague Dependency Treebank PARSEME Training School Prague

  9. The Prague Dependency Treebanks: the Basics • Original Treebank: PDT 1.0, 2001 (morf., dep. syntax) • First full release: PDT 2.0 • http://ufal.mff.cuni.cz/pdt2.0 • LDC2006T01, see http://www.ldc.upenn.edu • Now: PDT 3.0: http://ufal.mff.cuni.cz/pdt3.0 • Basic general features • Multilayered annotation, interlinked layers • Dependency-based syntax (both surface and deep) • Information structure of the sentence (topic/focus) • Grammatical and basic textual coreference • New: discourse relations, MWEs • Languages: Czech, English (also parallel), Arabic • Student work on “samples”: Indonesian, Urdu, Russian, … • Spoken: work started on Czech and English (non-parallel, dialogs) PARSEME Training School Prague

  10. The Prague Dependency Treebank • Three basic layers of annotation • Morphemic layer • Surface syntax (“analytical”) layer • “Tectogrammatical” layer: underlying syntax, semantic roles (valency), inf. structure, coreference • Size • 830,000 words (tokens) = 50000 sentences in 3165 full documents (texts) • Format • Prague Markup Language (XML-based) • Now also: .treex format • For smooth uise in the TreeX platform • http://ufal.mff.cuni.cz/treex PARSEME Training School Prague

  11. PDT (Czech) Data • 4 sources: • Lidové noviny (daily newspaper, incl. extra sections) • DNES (Mladá fronta Dnes) (daily newspaper) • Vesmír (popular science magazine, monthly) • Českomoravský Profit (economical journal, weekly) • Full articles selected • article ~ DOCUMENT (basic corpus unit) • Time period: 1990-1995 • 1.8 million tokens (~110,000 sentences total) PARSEME Training School Prague

  12. PDT 1.0 (2001) PDT 2.0 (2006) PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, “functor”, grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon; PDT 3.0: mass, clauses, formemes, discourse, ... PARSEME Training School Prague

  13. PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon PARSEME Training School Prague

  14. Morphological Attributes Ex.: nejnezajímavějším “(to) the most uninteresting” Tag: 13 categories Example: AAFP3----3N---- Adjective no poss. Gendernegated Regular no poss. Numberno voice Feminine no personreserve1 Pluralno tensereserve2 Dative superlativebase var. Lemma: POS-unique identifier Books/verb -> book-1, went -> go, to/prep. -> to-1 PARSEME Training School Prague

  15. PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon PARSEME Training School Prague

  16. governor dependent Layer 2 (a-layer): Analytical Syntax • Dependency + Analytical Function The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. PARSEME Training School Prague

  17. Analytical Syntax: Functions • Main (for [main] semantic lexemes): • Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom • “Double” dependency: AtrAdv, AtrObj, AtrAtr • Special (function words, punctuation,...): • Reflexives, particles: AuxT, AuxR, AuxO, AuxZ, AuxY • Prepositions/Conjunctions: AuxP, AuxC • Punctuation, Graphics: AuxX, AuxS, AuxG, AuxK • Structural • Elipsis: ExD, Coordination etc.: Coord, Apos PARSEME Training School Prague

  18. PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon PARSEME Training School Prague

  19. Tectogrammatical Annotation • Underlying (deep) syntax • 5 sublayers (integrated and/or standoff annotation): • dependency structure, (detailed) functors • valency annotation • topic/focus and deep word order • coreference (mostly grammatical only) • discourse • all the rest (grammatemes): • detailed functors • underlying gender, number, mass nouns, ... • Total: 39 attributes (vs. 5 at m-layer, 2 at a-layer) PARSEME Training School Prague

  20. Tectogrammatical vs. analytical syntax AR: All words Predicate verb “Location” TR: No function words Re-inserted elided actor of “making” In practice, that procedure will require making of certified copies. PARSEME Training School Prague

  21. Dependency Structure • Similar to the surface (Analytical) layer... ...but: • certain nodes deleted • auxiliaries, non-autosemantic words, punctuation • (some) multiword expressions -> 1 node • some nodes added • based on word (mostly verb, noun) valency • some ellipsis resolution • detailed dependency relation labels (functors) PARSEME Training School Prague

  22. Tectogrammatical Functors semantic “syntactic” • “Actants”: ACT, PAT, EFF, ADDR, ORIG • modify: verbs, nouns, adjectives • cannot repeat in a clause, usually obligatory • Free modifications (~ 50), semantically defined • can repeat; optional, sometimes obligatory • Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, CPHR, ... • Special • Coordination, Rhematizers, Foreign phrases (#Forn),... MWEs PARSEME Training School Prague

  23. Analytical dep. tree: Deep Word Order Topic/Focus • Example: • Baker bakes rolls. vs. BakerIC bakes rolls. PARSEME Training School Prague

  24. Deep Word OrderTopic/Focus • Deep word order: • from “old” information to the “new” one (left-to-right) at every level (head included) • projectivity by definition (almost...) • i.e., partial level-based order -> total d.w.o. • Topic/focus/contrastive topic • attribute of every node (t, f, c) • restricted by d.w.o. and other constraints PARSEME Training School Prague

  25. Coreference • Grammatical (easy) • relative clauses • which, who • Peter and Paul, who ... • control • infinitival constructions • John promised to go home • reflexive pronouns • {him,her,thme}self(-ves) • Mary saw herself in ... • promise • PRED • go • John • PAT • ACT • home • he • DIR3 • ACT PARSEME Training School Prague

  26. Coreference • Textual • Ex.: Peter moved to Iowa after he finished his PhD. PARSEME Training School Prague

  27. Grammatemes • Detailed functors (“subfunctors”) • needed for some functors: • TWHEN: before/after • LOC: next-to, behind, in-front-of, ... • also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT • Lexical (underlying) • number (Sg/Pl), tense, modality, degree of comparison, mass-noun?; is_person_name, is_dsp_root, ... MWEs PARSEME Training School Prague

  28. Valency in pdt PARSEME Training School Prague

  29. Prague Dependency Treebank & Valency • Valency in the PDT • Valency lexicon for PDT • General valency lexicon • Valency in deep vs. surface syntax • Links between the layers w.r.t. valency • Valency and word sense • Sense-disambiguated occurrences: • Links from data to the lexicon • Valency in translation, text generation PARSEME Training School Prague

  30. Definition of Valency • Ability (“desire”) of words (verbs, nouns, adjectives) to combine themselves with other units of meaning • Properties of valency: • Specific for every word meaning (in general) • leave: sb left sth for sbvs. sb left from somewhere • similar to PropBank leave.02 vs. leave.01 • Typically strongly correlates with surface form (Czech) • morphological case (~ ending), preposition+case, ...) • Semantic constraints PARSEME Training School Prague

  31. vyměnit (to replace) vyměnit1 ACT PAT EFF Nom. Acc. za+Acc. vyměnit2 ... Structure of Valency • word (lemma) • word sense group 1 • valency frame: • slot1 slot2 slot3 • surface expression • word sense group 2 • ... PARSEME Training School Prague

  32. PDT-VallexEntry • dosáhnout: “to reach”, “to get [sb to do sth]” • browser/user-formatted example: PARSEME Training School Prague

  33. MWEs in PDT-Vallex • Types included: • Reflexive particle (se, si) • smát se – to laugh • všimnout si – to notice • Idiomatic constructions • dosáhnout svého - to achieve one’s goals • běhá mi mráz po zádech – to give me the shivers • Light verb constructions (and similar) • uzavřit dohodu – to agree [on sth], strike an agreement, ... • vzbuzovat pochybnosti – to doubt, to raise doubts smát_se (t_lemma) DPHR (argument) CPHR (argument) PARSEME Training School Prague

  34. Sentence 15345: Sentence 51042: Sentence 2035: Corpus ↔ Valency Lexicon • Corpus: ENTRY: uzavřít (to close) vf1: ACT(.1) CPHR({smlouva}.4) ex: u. dohodu (close a contract) vf2: ACT(.1) PAT(.4) ex.: u. pokoj (close a room, house) • Lexicon: PARSEME Training School Prague

  35. Valency & Text Generation • Using valency for... • ...getting the correct (lemma, tag) of verb arguments • Example: • VALLEX entry: starat (se) ACT(.1) PAT(o.[.4]) starat V.............. starat_se PRED “to take care of” o ............... Martin ....1.......... se ............... Martin ACT tygr PAT • “tiger” “Martin takes care of tigers.” tygr ....4.......... Martin se stará o tygry. PARSEME Training School Prague

  36. Parallel Treebank Cz-EN PARSEME Training School Prague

  37. Parallel Czech-English Annotation • English text → Czech text (human translation) • Czech side (goal): all layers manual annotation • English side (goal): • Morphology and surface syntax: technical conversion • Penn Treebank style -> PDT Analytic layer • Tectogrammatical annotation: manual annotation • (Slightly) different rules needed for English • Alignment • Natural, sentence level only (now) PARSEME Training School Prague

  38. English Annotation POS and Syntax • Automatic conversion from Penn Treebank • PDT morphological layer • From POS tags • PDT analytic layer • From: • Penn Treebank Syntactic Structure • Non-terminal labels • Function tags (non-terminal “suffixes”) • 2-step process • Head determination rules • Conversion to dependency + analytic function PARSEME Training School Prague

  39. Czech-English Example Dicku Darmane, zavolejte do své kanceláře! Dick Darman, call your office! PARSEME Training School Prague

  40. SUMMARY OF PART 1/1 PARSEME Training School Prague

  41. PDT Treebanks at UFAL (written language) • Czech • Prague Dependency Treebank • Complex annotation, all levels, additional annotation • Translation of Penn Treebank • Tectogrammatical layer only, no t/f • Analytical, morphology: automatic tool • English • Re-annotation of Penn Treebank • Other languages • Arabic (own annotation) • Other: by conversion (HamleDT – 30 treebanks) PARSEME Training School Prague

  42. Prague Dependency Treebanks Now also in the Universal Dependency format! https://github.com/UniversalDependencies • Annotation: • 4 layers: • Words, lemmas/tags, surface dep. syntax, tectogrammatics • Tectogrammatical layer: • No function words, semantic relations • Valency/verb arguments (some MWE features) • Separate valency lexicon, fully linked from PDT nodes • Coreference, Topic/focus, Discourse • Links back to analytical layer (parsing!) PARSEME Training School Prague

  43. Pointers • PDT 2.0 (the “Original”), newest version: PDT 3.0 • http://ufal.mff.cuni.cz/pdt2.0 • http://ufal.mff.cuni.cz/pdt3.0 • PCEDT • http://ufal.mff.cuni.cz/pcedt2.0/ • PEDT • English side of PCEDT, additional: NE, coreference • http://ufal.mff.cuni.cz/pedt2.0/ • PADT (Arabic, morphology + surface syntax) • http://ufal.mff.cuni.cz/padt • Other corpora, PDT-Vallex, EngVallex: • Search at http://lindat.cz • LDC catalog numbers: • LDC2006T01 (PDT 2.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PEDT 1.0) • CoNLL 2009 shared task (7 languages, surface syntax + predicate arguments only) • http://ufal.mff.cuni.cz/conll2009-st • HamleDT 2.0 (30 treebanks in unified format) • http://ufal.mff.cuni.cz/hamledt PARSEME Training School Prague

More Related