1 / 21

PDT: The Tools

PDT: The Tools. Jan Haji č Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Tectogrammatical Annotation Tools. Manual annotation Speech Reconstruction: MEd

ovidio
Download Presentation

PDT: The Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PDT:The Tools Jan Hajič Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  2. Tectogrammatical AnnotationTools • Manual annotation • Speech Reconstruction: MEd • Morphology (linear structure annotation): LAW • Special graphical tool (TrEd) • Customizable graphical tree editor • Viewing and Searching • TrEd, Netgraph (linear structure: also Bonito/Manatee) • Automatic annotation • (ASR, Segmentation), Morphology, Tagging, Parsing, Deep parsing, Co-reference, WSD, … • Generation • Jan Ptacek’s generation tools (rule-based, so far) Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  3. Manual annotation • Speech reconstruction • MEd • z-layer, w-layer, m-layer • Audio – annotators can listen • Morphology • LAW – new version fro fast morphological disambiguation • Syntax (analytical, tectogrammatical) • TrEd Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  4. MEd: speech reconstruction viewer / annotation tool • m-layer (annotation) • w-layer • z-layer • audio Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  5. The Morphological Annotation Tool (LAW) Java-based Dictionary access XML-aware PML: m-layer Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  6. TrEd: Manual Annotation Tool • Perl/PerlTk based, platform-independent • Linux, Windows 95/98/2000, Solaris, ... • Perl as the “macro” language • “unlimited” online processing capability • Flexibility for interactive checking • split screen, graphical “diff” function • Customization, printing, “plugins”, ... • [Automatic processing: btred – no GUI] • [Fast search (parallel processing): btred/ntred] Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  7. Original sentence: [This year’s flu season is still quiet in Europe.] Editing window customization Run a macro Multiwindow editing/compare The “TrEd” Tree Editor • Graphical tool TrEd • Main screen: Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  8. TrEd Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  9. Valency Lexicon in TrEd to write sth (about sth) Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  10. Searching the treebank • TrEd • (obviously) • Programming possible (perl) • Fast search (parallelization) • Netgraph • Linguist-user-friendly • Easy to write queries • Not as flexible • Java Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  11. Netgraph • Query Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  12. Netgraph • Search results Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  13. Automatic annotation • Morphological analysis • Tagging • Parsing (surface) • Tectogrammatical (deep) parsing • Tectogrammatical structure • Co-reference • Grammatemes • Generation Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  14. Morphological dictionary • Czech • UFAL-developed • C implementation • 800k lemmas • English • Open source • Amorph-generated from data • From WSJ Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  15. Tagging • Czech • 10+ taggers • Best: “MORCE” • Averaged perceptron + unsupervised + rules, > 96% • Testing on spoken (ASR) input • English • Off-the-shelf (97%) • (…will retrain MORCE on WSJ/PTB) • NB: within parsers (mostly) Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  16. Parsing • Czech • McDonald et al. • MST + MIRA, 85-86% dep. Accuracy • Labeling (afun) • C 5.0 or within parser, also ~ 85% accuracy • English • Collins / Charniak • NB: Phrase-based Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  17. Tectogrammatical parsing • Czech: • TrEd-implemented, 4-step process • Starts from analytic layer • English • Rule-based so far • Too little data annotated • Annotation underway currently • Starts from classical Collins/Charniak WSJ-type output Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  18. Tectogrammatical parsing - accuracy • Newest results: • 4 phases • Transformation -based learning • FnTBL • Largely langu- age independent • Coreference: >90% m- and a-layer:Attributemanualautostructure 89,3 % 76,4 %functor 85,5 % 77,4 %val_frame.rf 92,3 % 90,9 %t_lemma 93,5 % 90,9 %nodetype 94,5 % 92,6 %gram/sempos 93,8 % 91,5 %a/lex.rf 96,5 % 95,1 %a/aux.rf 94,3 % 90,3 %is_member 94,3 % 89,5 %is_generated 96,6 % 95,2 %deepord 68,0 % 66,7 % Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  19. Word sense disambiguation • For words with valency frames • All verbs • Some nouns, adjectives • Valency frame ~ meaning (sense) • Jiri Semecky’s work • Accuracy on PDT: 70%+ • Portable to English • No results yet Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  20. Generation • From TR to text • Jan Ptacek’s work (cf. review meeting) • Rule-based • Czech: completed • Integrated with TTS (UWB) • English: before completion of first version • Results • No metrics yet, subjectively very good Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

  21. Some (more) pointers • http://ufal.mff.cuni.cz/pdt2.0 • Current version of PDT, all three levels, 1.9/1.5/0.8 Mw • http://ufal.mff.cuni.cz/REST/CAC/CAC.html • The Czech Academic Corpus, v 1.0 • http://www.ldc.upenn.edu • LDC2001T10 (PDT v1.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PCEDT 1.0) • http://www.clsp.jhu.edu: Workshop 2002 • Using TL for MT Generation Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax

More Related