1 / 48

Prague Dependency Treebank(s) Workshop at LSA2011, Part I

Prague Dependency Treebank(s) Workshop at LSA2011, Part I. Jan Haji č , Zde ňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Part I - Text to Syntax.

kaloni
Download Presentation

Prague Dependency Treebank(s) Workshop at LSA2011, Part I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prague Dependency Treebank(s)Workshop at LSA2011, Part I Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic

  2. Part I - Text to Syntax • The Prague Dependency Treebank projects • The theory behind it • The Corpora • Morphology • Dictionaries, tools (incl. POS tagger) • Dependency surface syntax • Czech, Arabic, English • Parallel annotated corpus LSA2011 Prague Dependency Treebanks I

  3. The Prague Dependency Treebank(s) • The idea • Apply the “old” Prague theory to real-word texts • Provide enough data for ML experiments • ?“Old” Prague theory • Prague structuralism (1930s) • Stratificational approach • Centered on “deep syntax” • Separated from “surface form” • Dependency based (how else ) LSA2011 Prague Dependency Treebanks I

  4. PDT:The Methodology • Manual annotation is PRIMARY • Some help from existing tools possible • “No information loss, no redundancy” • Much formalization, but… • … original form always retrievable • Dictionaries • In theory: “secondary”, side effect of annotation (generalization) • In reality: help consistency • Links: data → dictionary(-ies) • Result: used for Machine Learning • Ergonomy of annotation • Graphical (“linguistic”) presentation & editing LSA2011 Prague Dependency Treebanks I

  5. The Prague Dependency Treebank Project: Czech Treebank • 1995 (Dublin) 1996-2006-... • 1998 PDT v. 0.5 released (JHU workshop) • 400k words manually annotated, unchecked • 2001 PDT 1.0 released (LDC): • 1.3MW annotated, morphology & surface syntax • 2006 PDT 2.0 release • 0.8MW annotated (50k sentences) + PDT 1.0 corrected • the “tectogrammatical layer” • underlying (deep) syntax LSA2011 Prague Dependency Treebanks I

  6. Related Projects (Treebanks) • Prague Czech-English Dependency Treebank • WSJ portion of PTB, translated to Czech (1.2 mil. words) • Annotated for basice set of attributes • Penn Treebank / WSJ • Pre-converted, manually annotated for basic sets of attributes • Named entity annotation, co-reference (BBN) merged in • Detailed breakdown of NP • Prague Arabic Dependency Treebank • apply same representation to annotation of Arabic • surface syntax so far • Both published (in first version) 2004 (by LDC) • PCEDT/PEDT version 2.0 being prepared (2011?) • Preliminary version available for browsing – see the workshop web LSA2011 Prague Dependency Treebanks I

  7. PDT (Czech) Data • 4 sources: • Lidové noviny (daily newspaper, incl. extra sections) • DNES (Mladá fronta Dnes) (daily newspaper) • Vesmír (popular science magazine, monthly) • Českomoravský Profit (economical journal, weekly) • Full articles selected • article ~ DOCUMENT (basic corpus unit) • Time period: 1990-1995 • 1.8 million tokens (~110,000 sentences) LSA2011 Prague Dependency Treebanks I

  8. PDT 1.0 (2001) PDT 2.0 (2006) PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon LSA2011 Prague Dependency Treebanks I

  9. PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon LSA2011 Prague Dependency Treebanks I

  10. Morphological Attributes Ex.: nejnezajímavějším “(to) the most uninteresting” • Tag: 13 categories • Example: AAFP3----3N---- Adjective no poss. Gendernegated Regular no poss. Numberno voice Feminine no personreserve1 Pluralno tensereserve2 Dative superlativebase var. • Lemma: POS-unique identifier Books/verb -> book-1, went -> go, to/prep. -> to-1 LSA2011 Prague Dependency Treebanks I

  11. Morphological Tagset • 13 categories, 4452 plausible tags (combinations): LSA2011 Prague Dependency Treebanks I

  12. Morphological Analysis • Formally: MA: A+→ Pow(L x T) • MA(f) = { [ l,t ] }; • f  A+ (the token), • l  L (lemma), • t  T (tag) • tokens taken in isolation • no attempt to solve e.g. auxiliaries vs. full verbs • Ex.: MA(“má“) = { [mít,VB-S---3P-AA---],lit. “to have” lit. “has”,”my”[můj,PSFS1-S1------1], lit. “my” [můj,PSFS5-S1------1], [můj,PSNP1-S1------1], [můj,PSNP4-S1------1], [můj,PSNP5-S1------1] } LSA2011 Prague Dependency Treebanks I

  13. Morphological Disambiguation • Full morphological disambiguation • more complex than (e.g. English) POS tagging • Several full morphological taggers: • (Pure) HMM • Feature-based (MaxEnt, NB) • used in the PDT distribution • Voted Perceptron, (M. Collins, EMNLP’02) • All: ~ 94-96% accuracy (perceptron is best) • rule & statistic combination: tiny improvement (Hajič et al., ACL 2001, Spoustova et al., 2007: > 96%) LSA2011 Prague Dependency Treebanks I

  14. The Segmentation Problem:Arabic • Tokenization / segmentation not always trivial • Arabic, German, Chinese, Japanese LSA2011 Prague Dependency Treebanks I

  15. The Segmentation Problem:Solution for Arabic • Find max. no. of segments, concatenate up to max. • 4 (x10) for Arabic •  F---------VIIA-3MS--S----3MP4----------- •  sa-+yu-hbir-u+-hum+0 • Resulting annotation: F---------VIIA-3MS--S----3MP4----------- P---------SD----MS---------------------- P--------------------------------------- N-------2R------------------------------ N-------2D------------------------------ A-----FS2D------------------------------ LSA2011 Prague Dependency Treebanks I

  16. Arabic Tagging Results • Maximum entropy, features ~ categories • Experiments on Penn Arabic Treebanks • POS: From 95.25-97.37% • Full morph. 88.17-89.31% • Segmentation: 98.60-99.37% • Prague Arabic Dependency Treebank • POS: 96.02% • Full morph.: 89.24% • Segmentation: 99.25% LSA2011 Prague Dependency Treebanks I

  17. PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon LSA2011 Prague Dependency Treebanks I

  18. governor dependent Layer 2 (a-layer): Analytical Syntax • Dependency + Analytical Function The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. LSA2011 Prague Dependency Treebanks I

  19. Analytical Syntax: Functions • Main (for [main] semantic lexemes): • Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom • “Double” dependency: AtrAdv, AtrObj, AtrAtr • Special (function words, punctuation,...): • Reflefives, particles: AuxT, AuxR, AuxO, AuxZ, AuxY • Prepositions/Conjunctions: AuxP, AuxC • Punctuation, Graphics: AuxX, AuxS, AuxG, AuxK • Structural • Elipsis: ExD, Coordination etc.: Coord, Apos LSA2011 Prague Dependency Treebanks I

  20. Surface Syntax Example • Complete sentence: Sb, Pred, Obj • The-baker bakes rolls. • Pekař peče housky. LSA2011 Prague Dependency Treebanks I

  21. Surface Syntax Example • Incomplete phrases • Peter works well , but Paul badly • Petr pracuje dobře, ale Pavel špatně LSA2011 Prague Dependency Treebanks I

  22. Surface Syntax Example • Variants (equal meaning) • (he) bought shoes for boy • koupil boty pro kluka LSA2011 Prague Dependency Treebanks I

  23. PDT-styleArabic Surface Syntax • Only several differences • (Sometimes) Separate nodes for individual segments (cf. tagging/segmentation) • Copula treatment (Czech: rare  treated as ellispsis; Arabic: systematic solution needed): Pred • (Added) analytic functions: • AuxM (did-not) • Ante (what) • Work by Faculty of Arts, Charles University • Arabic language students LSA2011 Prague Dependency Treebanks I

  24. Arabic Surface SyntaxExample • In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it. LSA2011 Prague Dependency Treebanks I

  25. English Analytic Layer • By conversion from PTB • Extended analytic functions • Head rules • Jason Eisner’s, added more for full conversion • Coordination, traces, etc. • Coordination handling • Same as in Czech/Arabic PDT LSA2011 Prague Dependency Treebanks I

  26. Penn Treebank • University of Pennsylvania, 1993 • Linguistic Data Consortium • Wall Street Journal texts, ca. 50,000 sentences • 1989-1991 • Financial (most), news, arts, sports • 2499 (2312) documents in 25 sections • Annotation • POS (Part-of-speech tags) • Syntactic “bracketing” + bracket (syntactic) labels • (Syntactic) Function tags, traces, co-indexing • + Propbanking LSA2011 Prague Dependency Treebanks I

  27. POS tag (NNS) • (noun, plural) • Noun Phrase • Phrase label (NP) Penn Treebank Example • ( (S • (NP-SBJ • (NP (NNP Pierre) (NNP Vinken) ) • (, ,) • (ADJP • (NP (CD 61) (NNS years) ) • (JJ old) ) • (, ,) ) • (VP (MD will) • (VP (VB join) • (NP (DT the) (NN board) ) • (PP-CLR (IN as) • (NP (DT a) (JJ nonexecutive) (NN director) )) • (NP-TMP (NNP Nov.) (CD 29) ))) • (. .) )) Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. • “Preterminal” LSA2011 Prague Dependency Treebanks I

  28. Penn Treebank Example:Sentence Tree • Phrase-based tree representation: LSA2011 Prague Dependency Treebanks I

  29. Parallel Czech-English Annotation • English text -> Czech text (human translation) • Czech side (goal): all layers manual annotation • English side (goal): • Morphology and surface syntax: technical conversion • Penn Treebank style -> PDT Analytic layer • Tectogrammatical annotation: manual annotation • (Slightly) different rules needed for English • Alignment • Natural, sentence level only (now) LSA2011 Prague Dependency Treebanks I

  30. Human Translation ofWSJ Texts • Hired translators / FCE level • Specific rules for translation • Sentence per sentence only • …to get simple 1:1 alignment • Fluent Czech at the target side • If a choice, prefer “literal” translation • The numbers: • English tokens: 1,173,766 • Translated to Czech: • Revised/PCEDT 1.0: 487,929 • Now finished (all 2312 documents) LSA2011 Prague Dependency Treebanks I

  31. English Annotation POS and Syntax • Automatic conversion from Penn Treebank • PDT morphological layer • From POS tags • PDT analytic layer • From: • Penn Treebank Syntactic Structure • Non-terminal labels • Function tags (non-terminal “suffixes”) • 2-step process • Head determination rules • Conversion to dependency + analytic function LSA2011 Prague Dependency Treebanks I

  32. Head Determination Rules • Exhaustive set of rules • By J. Eisner + M. Cmejrek/J. Curin • 4000 rules (non-terminal based) • Ex.: (S (NP-SBJ VP .)) → VP • Additional rules • Coordination, Apposition • Punctuation (end-of-sentence, internal) • Original idea (possibility of conversion) • J. Robinson (1960s) LSA2011 Prague Dependency Treebanks I

  33. Example: Head Determination Rules (J.E.) (join) (join) (join) (will) • Rules: (join) (board) (NP (DT NN)) → NN (VP (VB NP)) → VB (board) (the) (VP (MD VP)) → VP (S (… VP …)) → VP LSA2011 Prague Dependency Treebanks I

  34. Conversion: Analytic Structure, Functions • Analytic Function assignment (conversion) • Rules • based on functional tags: -SBJ Sb-PRD Pnom -BNF Obj-DTV Obj -LGS Obj-ADV Adv -DIR Adv-EXT Adv -LOC Adv-MNR Adv -PRP Adv-PUT Adv -TMP Adv • Ad-hoc rules (if functional tags missing) • Lemmatization (years → year) LSA2011 Prague Dependency Treebanks I

  35. Example: Analytical Structure, Functions (join) (join) → → (join) (will) (join) (board) (board) (the) Penn Treebank structure (with heads added) PDT-like Analytic Representation LSA2011 Prague Dependency Treebanks I

  36. Annotation tools • TrEd • Visualization, annotation, processing, search • Perl, Perl-Tk • Customizable • Multiple data formats, native: Prague ML (XML) • Demonstration I • Surface-syntax annotation LSA2011 Prague Dependency Treebanks I

  37. Speech reconstruction • Spoken input • ASR does not do capitalization, punctuation • Disfluencies • Repetitions • Corrections • Fillers • Deviations from syntax • ...  ungrammatical / dissimilar to text LSA2011 Prague Dependency Treebanks I

  38. Objective • Create gold-standard data for • (Statistical) training • Testing • Use in machine learning of automatic • speech reconstruction • (eventually) language understanding • Go beyond state-of-the-art • ASR Post-Correction / disfluency removal • (cf. e.g. Fitzgerald, 2008, or Lopez/Cozar & Callejas, 2008) LSA2011 Prague Dependency Treebanks I

  39. Disfluency removal Speech reconstruction What is Speech Reconstruction • original transcript →editedtranscript • resembling an interview editing for print ... apart from disfluency removal? we ‘re sitting on the step of we ‘re sitting on the step of and I think it ‘s my aunt Molly ‘s house my aunt Molly ‘s house ?? I think We’re sitting on the step of my aunt Molly’s house, . LSA2011 Prague Dependency Treebanks I

  40. Word-for-word transcription • using Transcriber 1.5.1 • audio synchronization • spoken text • non-speech events (UH-HUH, laughter, etc.) • rough segmentation (defined acoustically) • speaker/turn identification LSA2011 Prague Dependency Treebanks I

  41. Annotation rules • sentence segmentation • orthography • capitalization • punctuation • morphosyntax • word order • partial ellipsis restoration • no discourse-irrelevant non- speech events • filler and fragment deletion • but: ...meaning preservation ~ written-text standards LSA2011 Prague Dependency Treebanks I

  42. Segment splitting (reconstruction) segment = sentence (transcript) segment ~ non-silence span LSA2011 Prague Dependency Treebanks I

  43. Word order, deletions, insertions, ... punctuation, capitalization, ... LSA2011 Prague Dependency Treebanks I

  44. Function word insertion LSA2011 Prague Dependency Treebanks I

  45. Annotation layers • Multilayered standoff annotation LSA2011 Prague Dependency Treebanks I

  46. Current status • English dialogues • 151,000 words (14,5 h) • 16,000 words double annotated • Audio / manual transcript / reconstruction • Auto tagging/parsing: syntax / semnatics • Annotation manual • Baseline automatic SR systems LSA2011 Prague Dependency Treebanks I

  47. Conclusion • Language resources • Beyond post-ASR corrections: “Speech Reconstruction” • Integrated annotation of • Audio, ASR, manual transcription, edited speech reconstruction • [morphology, syntax, semantics] • The next step: machine learning LSA2011 Prague Dependency Treebanks I

  48. Czech Example Original transcription: ale taky důvod byl ten že škodováci byli hrozně rádi bejvali kdybych tam byla mohla nastoupit k ni - k nim jako do zaměstnání Recovering punctuation and capitalization: Ale taky důvod byl ten , že škodováci byli hrozně rádi bejvali , kdybych tam byla , mohla nastoupit k ni k nim jako do zaměstnání Translation: Ale taky důvod byl ten , že škodováci bývali byli hrozně rádi , kdybych tam byla , mohla nastoupit k ní , k nim jako do zaměstnání Reference reconstructions: (a) Ale důvod byl taky ten , že "škodováci" by bývali byli hrozně rádi , kdybych tam k nim mohla nastoupit do zaměstnání . (b) Důvod byl ale taky ten , že škodováci by bývali byli hrozně rádi , kdybych tam byla mohla nastoupit k nim jako do zaměstnání . LSA2011 Prague Dependency Treebanks I

More Related