480 likes | 632 Views
Prague Dependency Treebank(s) Workshop at LSA2011, Part I. Jan Haji č , Zde ňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Part I - Text to Syntax.
E N D
Prague Dependency Treebank(s)Workshop at LSA2011, Part I Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic
Part I - Text to Syntax • The Prague Dependency Treebank projects • The theory behind it • The Corpora • Morphology • Dictionaries, tools (incl. POS tagger) • Dependency surface syntax • Czech, Arabic, English • Parallel annotated corpus LSA2011 Prague Dependency Treebanks I
The Prague Dependency Treebank(s) • The idea • Apply the “old” Prague theory to real-word texts • Provide enough data for ML experiments • ?“Old” Prague theory • Prague structuralism (1930s) • Stratificational approach • Centered on “deep syntax” • Separated from “surface form” • Dependency based (how else ) LSA2011 Prague Dependency Treebanks I
PDT:The Methodology • Manual annotation is PRIMARY • Some help from existing tools possible • “No information loss, no redundancy” • Much formalization, but… • … original form always retrievable • Dictionaries • In theory: “secondary”, side effect of annotation (generalization) • In reality: help consistency • Links: data → dictionary(-ies) • Result: used for Machine Learning • Ergonomy of annotation • Graphical (“linguistic”) presentation & editing LSA2011 Prague Dependency Treebanks I
The Prague Dependency Treebank Project: Czech Treebank • 1995 (Dublin) 1996-2006-... • 1998 PDT v. 0.5 released (JHU workshop) • 400k words manually annotated, unchecked • 2001 PDT 1.0 released (LDC): • 1.3MW annotated, morphology & surface syntax • 2006 PDT 2.0 release • 0.8MW annotated (50k sentences) + PDT 1.0 corrected • the “tectogrammatical layer” • underlying (deep) syntax LSA2011 Prague Dependency Treebanks I
Related Projects (Treebanks) • Prague Czech-English Dependency Treebank • WSJ portion of PTB, translated to Czech (1.2 mil. words) • Annotated for basice set of attributes • Penn Treebank / WSJ • Pre-converted, manually annotated for basic sets of attributes • Named entity annotation, co-reference (BBN) merged in • Detailed breakdown of NP • Prague Arabic Dependency Treebank • apply same representation to annotation of Arabic • surface syntax so far • Both published (in first version) 2004 (by LDC) • PCEDT/PEDT version 2.0 being prepared (2011?) • Preliminary version available for browsing – see the workshop web LSA2011 Prague Dependency Treebanks I
PDT (Czech) Data • 4 sources: • Lidové noviny (daily newspaper, incl. extra sections) • DNES (Mladá fronta Dnes) (daily newspaper) • Vesmír (popular science magazine, monthly) • Českomoravský Profit (economical journal, weekly) • Full articles selected • article ~ DOCUMENT (basic corpus unit) • Time period: 1990-1995 • 1.8 million tokens (~110,000 sentences) LSA2011 Prague Dependency Treebanks I
PDT 1.0 (2001) PDT 2.0 (2006) PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon LSA2011 Prague Dependency Treebanks I
PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon LSA2011 Prague Dependency Treebanks I
Morphological Attributes Ex.: nejnezajímavějším “(to) the most uninteresting” • Tag: 13 categories • Example: AAFP3----3N---- Adjective no poss. Gendernegated Regular no poss. Numberno voice Feminine no personreserve1 Pluralno tensereserve2 Dative superlativebase var. • Lemma: POS-unique identifier Books/verb -> book-1, went -> go, to/prep. -> to-1 LSA2011 Prague Dependency Treebanks I
Morphological Tagset • 13 categories, 4452 plausible tags (combinations): LSA2011 Prague Dependency Treebanks I
Morphological Analysis • Formally: MA: A+→ Pow(L x T) • MA(f) = { [ l,t ] }; • f A+ (the token), • l L (lemma), • t T (tag) • tokens taken in isolation • no attempt to solve e.g. auxiliaries vs. full verbs • Ex.: MA(“má“) = { [mít,VB-S---3P-AA---],lit. “to have” lit. “has”,”my”[můj,PSFS1-S1------1], lit. “my” [můj,PSFS5-S1------1], [můj,PSNP1-S1------1], [můj,PSNP4-S1------1], [můj,PSNP5-S1------1] } LSA2011 Prague Dependency Treebanks I
Morphological Disambiguation • Full morphological disambiguation • more complex than (e.g. English) POS tagging • Several full morphological taggers: • (Pure) HMM • Feature-based (MaxEnt, NB) • used in the PDT distribution • Voted Perceptron, (M. Collins, EMNLP’02) • All: ~ 94-96% accuracy (perceptron is best) • rule & statistic combination: tiny improvement (Hajič et al., ACL 2001, Spoustova et al., 2007: > 96%) LSA2011 Prague Dependency Treebanks I
The Segmentation Problem:Arabic • Tokenization / segmentation not always trivial • Arabic, German, Chinese, Japanese LSA2011 Prague Dependency Treebanks I
The Segmentation Problem:Solution for Arabic • Find max. no. of segments, concatenate up to max. • 4 (x10) for Arabic • F---------VIIA-3MS--S----3MP4----------- • sa-+yu-hbir-u+-hum+0 • Resulting annotation: F---------VIIA-3MS--S----3MP4----------- P---------SD----MS---------------------- P--------------------------------------- N-------2R------------------------------ N-------2D------------------------------ A-----FS2D------------------------------ LSA2011 Prague Dependency Treebanks I
Arabic Tagging Results • Maximum entropy, features ~ categories • Experiments on Penn Arabic Treebanks • POS: From 95.25-97.37% • Full morph. 88.17-89.31% • Segmentation: 98.60-99.37% • Prague Arabic Dependency Treebank • POS: 96.02% • Full morph.: 89.24% • Segmentation: 99.25% LSA2011 Prague Dependency Treebanks I
PDT Annotation Layers • L0 (w) Words (tokens) • automatic segmentation and markup only • L1 (m) Morphology • Tag (full morphology, 13 categories), lemma • L2 (a) Analytical layer (surface syntax) • Dependency, analytical dependency function • L3 (t) Tectogrammatical layer (“deep” syntax) • Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon LSA2011 Prague Dependency Treebanks I
governor dependent Layer 2 (a-layer): Analytical Syntax • Dependency + Analytical Function The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. LSA2011 Prague Dependency Treebanks I
Analytical Syntax: Functions • Main (for [main] semantic lexemes): • Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom • “Double” dependency: AtrAdv, AtrObj, AtrAtr • Special (function words, punctuation,...): • Reflefives, particles: AuxT, AuxR, AuxO, AuxZ, AuxY • Prepositions/Conjunctions: AuxP, AuxC • Punctuation, Graphics: AuxX, AuxS, AuxG, AuxK • Structural • Elipsis: ExD, Coordination etc.: Coord, Apos LSA2011 Prague Dependency Treebanks I
Surface Syntax Example • Complete sentence: Sb, Pred, Obj • The-baker bakes rolls. • Pekař peče housky. LSA2011 Prague Dependency Treebanks I
Surface Syntax Example • Incomplete phrases • Peter works well , but Paul badly • Petr pracuje dobře, ale Pavel špatně LSA2011 Prague Dependency Treebanks I
Surface Syntax Example • Variants (equal meaning) • (he) bought shoes for boy • koupil boty pro kluka LSA2011 Prague Dependency Treebanks I
PDT-styleArabic Surface Syntax • Only several differences • (Sometimes) Separate nodes for individual segments (cf. tagging/segmentation) • Copula treatment (Czech: rare treated as ellispsis; Arabic: systematic solution needed): Pred • (Added) analytic functions: • AuxM (did-not) • Ante (what) • Work by Faculty of Arts, Charles University • Arabic language students LSA2011 Prague Dependency Treebanks I
Arabic Surface SyntaxExample • In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it. LSA2011 Prague Dependency Treebanks I
English Analytic Layer • By conversion from PTB • Extended analytic functions • Head rules • Jason Eisner’s, added more for full conversion • Coordination, traces, etc. • Coordination handling • Same as in Czech/Arabic PDT LSA2011 Prague Dependency Treebanks I
Penn Treebank • University of Pennsylvania, 1993 • Linguistic Data Consortium • Wall Street Journal texts, ca. 50,000 sentences • 1989-1991 • Financial (most), news, arts, sports • 2499 (2312) documents in 25 sections • Annotation • POS (Part-of-speech tags) • Syntactic “bracketing” + bracket (syntactic) labels • (Syntactic) Function tags, traces, co-indexing • + Propbanking LSA2011 Prague Dependency Treebanks I
POS tag (NNS) • (noun, plural) • Noun Phrase • Phrase label (NP) Penn Treebank Example • ( (S • (NP-SBJ • (NP (NNP Pierre) (NNP Vinken) ) • (, ,) • (ADJP • (NP (CD 61) (NNS years) ) • (JJ old) ) • (, ,) ) • (VP (MD will) • (VP (VB join) • (NP (DT the) (NN board) ) • (PP-CLR (IN as) • (NP (DT a) (JJ nonexecutive) (NN director) )) • (NP-TMP (NNP Nov.) (CD 29) ))) • (. .) )) Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. • “Preterminal” LSA2011 Prague Dependency Treebanks I
Penn Treebank Example:Sentence Tree • Phrase-based tree representation: LSA2011 Prague Dependency Treebanks I
Parallel Czech-English Annotation • English text -> Czech text (human translation) • Czech side (goal): all layers manual annotation • English side (goal): • Morphology and surface syntax: technical conversion • Penn Treebank style -> PDT Analytic layer • Tectogrammatical annotation: manual annotation • (Slightly) different rules needed for English • Alignment • Natural, sentence level only (now) LSA2011 Prague Dependency Treebanks I
Human Translation ofWSJ Texts • Hired translators / FCE level • Specific rules for translation • Sentence per sentence only • …to get simple 1:1 alignment • Fluent Czech at the target side • If a choice, prefer “literal” translation • The numbers: • English tokens: 1,173,766 • Translated to Czech: • Revised/PCEDT 1.0: 487,929 • Now finished (all 2312 documents) LSA2011 Prague Dependency Treebanks I
English Annotation POS and Syntax • Automatic conversion from Penn Treebank • PDT morphological layer • From POS tags • PDT analytic layer • From: • Penn Treebank Syntactic Structure • Non-terminal labels • Function tags (non-terminal “suffixes”) • 2-step process • Head determination rules • Conversion to dependency + analytic function LSA2011 Prague Dependency Treebanks I
Head Determination Rules • Exhaustive set of rules • By J. Eisner + M. Cmejrek/J. Curin • 4000 rules (non-terminal based) • Ex.: (S (NP-SBJ VP .)) → VP • Additional rules • Coordination, Apposition • Punctuation (end-of-sentence, internal) • Original idea (possibility of conversion) • J. Robinson (1960s) LSA2011 Prague Dependency Treebanks I
Example: Head Determination Rules (J.E.) (join) (join) (join) (will) • Rules: (join) (board) (NP (DT NN)) → NN (VP (VB NP)) → VB (board) (the) (VP (MD VP)) → VP (S (… VP …)) → VP LSA2011 Prague Dependency Treebanks I
Conversion: Analytic Structure, Functions • Analytic Function assignment (conversion) • Rules • based on functional tags: -SBJ Sb-PRD Pnom -BNF Obj-DTV Obj -LGS Obj-ADV Adv -DIR Adv-EXT Adv -LOC Adv-MNR Adv -PRP Adv-PUT Adv -TMP Adv • Ad-hoc rules (if functional tags missing) • Lemmatization (years → year) LSA2011 Prague Dependency Treebanks I
Example: Analytical Structure, Functions (join) (join) → → (join) (will) (join) (board) (board) (the) Penn Treebank structure (with heads added) PDT-like Analytic Representation LSA2011 Prague Dependency Treebanks I
Annotation tools • TrEd • Visualization, annotation, processing, search • Perl, Perl-Tk • Customizable • Multiple data formats, native: Prague ML (XML) • Demonstration I • Surface-syntax annotation LSA2011 Prague Dependency Treebanks I
Speech reconstruction • Spoken input • ASR does not do capitalization, punctuation • Disfluencies • Repetitions • Corrections • Fillers • Deviations from syntax • ... ungrammatical / dissimilar to text LSA2011 Prague Dependency Treebanks I
Objective • Create gold-standard data for • (Statistical) training • Testing • Use in machine learning of automatic • speech reconstruction • (eventually) language understanding • Go beyond state-of-the-art • ASR Post-Correction / disfluency removal • (cf. e.g. Fitzgerald, 2008, or Lopez/Cozar & Callejas, 2008) LSA2011 Prague Dependency Treebanks I
Disfluency removal Speech reconstruction What is Speech Reconstruction • original transcript →editedtranscript • resembling an interview editing for print ... apart from disfluency removal? we ‘re sitting on the step of we ‘re sitting on the step of and I think it ‘s my aunt Molly ‘s house my aunt Molly ‘s house ?? I think We’re sitting on the step of my aunt Molly’s house, . LSA2011 Prague Dependency Treebanks I
Word-for-word transcription • using Transcriber 1.5.1 • audio synchronization • spoken text • non-speech events (UH-HUH, laughter, etc.) • rough segmentation (defined acoustically) • speaker/turn identification LSA2011 Prague Dependency Treebanks I
Annotation rules • sentence segmentation • orthography • capitalization • punctuation • morphosyntax • word order • partial ellipsis restoration • no discourse-irrelevant non- speech events • filler and fragment deletion • but: ...meaning preservation ~ written-text standards LSA2011 Prague Dependency Treebanks I
Segment splitting (reconstruction) segment = sentence (transcript) segment ~ non-silence span LSA2011 Prague Dependency Treebanks I
Word order, deletions, insertions, ... punctuation, capitalization, ... LSA2011 Prague Dependency Treebanks I
Function word insertion LSA2011 Prague Dependency Treebanks I
Annotation layers • Multilayered standoff annotation LSA2011 Prague Dependency Treebanks I
Current status • English dialogues • 151,000 words (14,5 h) • 16,000 words double annotated • Audio / manual transcript / reconstruction • Auto tagging/parsing: syntax / semnatics • Annotation manual • Baseline automatic SR systems LSA2011 Prague Dependency Treebanks I
Conclusion • Language resources • Beyond post-ASR corrections: “Speech Reconstruction” • Integrated annotation of • Audio, ASR, manual transcription, edited speech reconstruction • [morphology, syntax, semantics] • The next step: machine learning LSA2011 Prague Dependency Treebanks I
Czech Example Original transcription: ale taky důvod byl ten že škodováci byli hrozně rádi bejvali kdybych tam byla mohla nastoupit k ni - k nim jako do zaměstnání Recovering punctuation and capitalization: Ale taky důvod byl ten , že škodováci byli hrozně rádi bejvali , kdybych tam byla , mohla nastoupit k ni k nim jako do zaměstnání Translation: Ale taky důvod byl ten , že škodováci bývali byli hrozně rádi , kdybych tam byla , mohla nastoupit k ní , k nim jako do zaměstnání Reference reconstructions: (a) Ale důvod byl taky ten , že "škodováci" by bývali byli hrozně rádi , kdybych tam k nim mohla nastoupit do zaměstnání . (b) Důvod byl ale taky ten , že škodováci by bývali byli hrozně rádi , kdybych tam byla mohla nastoupit k nim jako do zaměstnání . LSA2011 Prague Dependency Treebanks I