210 likes | 289 Views
PDT: The Tools. Jan Haji č Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Tectogrammatical Annotation Tools. Manual annotation Speech Reconstruction: MEd
E N D
PDT:The Tools Jan Hajič Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Tectogrammatical AnnotationTools • Manual annotation • Speech Reconstruction: MEd • Morphology (linear structure annotation): LAW • Special graphical tool (TrEd) • Customizable graphical tree editor • Viewing and Searching • TrEd, Netgraph (linear structure: also Bonito/Manatee) • Automatic annotation • (ASR, Segmentation), Morphology, Tagging, Parsing, Deep parsing, Co-reference, WSD, … • Generation • Jan Ptacek’s generation tools (rule-based, so far) Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Manual annotation • Speech reconstruction • MEd • z-layer, w-layer, m-layer • Audio – annotators can listen • Morphology • LAW – new version fro fast morphological disambiguation • Syntax (analytical, tectogrammatical) • TrEd Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
MEd: speech reconstruction viewer / annotation tool • m-layer (annotation) • w-layer • z-layer • audio Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
The Morphological Annotation Tool (LAW) Java-based Dictionary access XML-aware PML: m-layer Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
TrEd: Manual Annotation Tool • Perl/PerlTk based, platform-independent • Linux, Windows 95/98/2000, Solaris, ... • Perl as the “macro” language • “unlimited” online processing capability • Flexibility for interactive checking • split screen, graphical “diff” function • Customization, printing, “plugins”, ... • [Automatic processing: btred – no GUI] • [Fast search (parallel processing): btred/ntred] Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Original sentence: [This year’s flu season is still quiet in Europe.] Editing window customization Run a macro Multiwindow editing/compare The “TrEd” Tree Editor • Graphical tool TrEd • Main screen: Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
TrEd Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Valency Lexicon in TrEd to write sth (about sth) Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Searching the treebank • TrEd • (obviously) • Programming possible (perl) • Fast search (parallelization) • Netgraph • Linguist-user-friendly • Easy to write queries • Not as flexible • Java Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Netgraph • Query Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Netgraph • Search results Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Automatic annotation • Morphological analysis • Tagging • Parsing (surface) • Tectogrammatical (deep) parsing • Tectogrammatical structure • Co-reference • Grammatemes • Generation Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Morphological dictionary • Czech • UFAL-developed • C implementation • 800k lemmas • English • Open source • Amorph-generated from data • From WSJ Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Tagging • Czech • 10+ taggers • Best: “MORCE” • Averaged perceptron + unsupervised + rules, > 96% • Testing on spoken (ASR) input • English • Off-the-shelf (97%) • (…will retrain MORCE on WSJ/PTB) • NB: within parsers (mostly) Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Parsing • Czech • McDonald et al. • MST + MIRA, 85-86% dep. Accuracy • Labeling (afun) • C 5.0 or within parser, also ~ 85% accuracy • English • Collins / Charniak • NB: Phrase-based Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Tectogrammatical parsing • Czech: • TrEd-implemented, 4-step process • Starts from analytic layer • English • Rule-based so far • Too little data annotated • Annotation underway currently • Starts from classical Collins/Charniak WSJ-type output Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Tectogrammatical parsing - accuracy • Newest results: • 4 phases • Transformation -based learning • FnTBL • Largely langu- age independent • Coreference: >90% m- and a-layer:Attributemanualautostructure 89,3 % 76,4 %functor 85,5 % 77,4 %val_frame.rf 92,3 % 90,9 %t_lemma 93,5 % 90,9 %nodetype 94,5 % 92,6 %gram/sempos 93,8 % 91,5 %a/lex.rf 96,5 % 95,1 %a/aux.rf 94,3 % 90,3 %is_member 94,3 % 89,5 %is_generated 96,6 % 95,2 %deepord 68,0 % 66,7 % Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Word sense disambiguation • For words with valency frames • All verbs • Some nouns, adjectives • Valency frame ~ meaning (sense) • Jiri Semecky’s work • Accuracy on PDT: 70%+ • Portable to English • No results yet Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Generation • From TR to text • Jan Ptacek’s work (cf. review meeting) • Rule-based • Czech: completed • Integrated with TTS (UWB) • English: before completion of first version • Results • No metrics yet, subjectively very good Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax
Some (more) pointers • http://ufal.mff.cuni.cz/pdt2.0 • Current version of PDT, all three levels, 1.9/1.5/0.8 Mw • http://ufal.mff.cuni.cz/REST/CAC/CAC.html • The Czech Academic Corpus, v 1.0 • http://www.ldc.upenn.edu • LDC2001T10 (PDT v1.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PCEDT 1.0) • http://www.clsp.jhu.edu: Workshop 2002 • Using TL for MT Generation Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax