200 likes | 399 Views
Prague Arabic Dependency Treebank. Introduction & Related Projects. MALACH Workshop in Prague August 28, 2003. Otakar Smrž et al. PADT Project at a Glance. Dependency treebank of Modern Standard Arabic Morphology 58,148 tokens Analytical syntax 41,288 tokens
E N D
Prague Arabic Dependency Treebank Introduction & Related Projects MALACH Workshop in Prague August 28, 2003 Otakar Smrž et al.
PADT Project at a Glance • Dependency treebank of Modern Standard Arabic • Morphology 58,148 tokens • Analytical syntax 41,288 tokens • Tectogrammatical description in preparation • Experience of the Prague Dependency Treebank • Guidelines and annotations by Charles University • Since 2001 ~ five annotators ~ three researchers • Cooperation with the Linguistic Data Consortium • Source corpora, morphological analyzer, workshops Prague Arabic Dependency Treebank: Introduction & Related Projects
Presentation Outline • Introductory issues in Arabic • Morphology and the writing system • Elementary syntactic constructs • LDC Arabic Treebank • Reference to ConDep conversion • Prague Arabic Dependency Treebank • Progress in the project, applications • Related projects and perspectives • Exchange of tools and ideas • Workshops and cooperation Prague Arabic Dependency Treebank: Introduction & Related Projects
Arabic Language and Script • Semitic language, inner flexion and concatenation, consonantal roots, weak derivation patterns • Phonemic script, non-vocalized script, word tying, other omissions الافرادفهم the members understood fahimaal-'afra~du the members were understood fuhimaal-'afra~du he understood the members fahimaal-'afra~da understanding to the isolation fahmual-'ifra~di and they are the individuals fa-humal-'afra~du Prague Arabic Dependency Treebank: Introduction & Related Projects
Morphology Issues • Arabic strings are extremely ambiguous • Short vowels, consonantal geminations, glottal-stop marks etc. normally omitted in the script • Strings need not correspond to single words • Morphonological changes increase the homonymy • Tokenization of input surface strings • Necessary pre-requisite to analytical annotation • Requires morphological disambiguation • Lexicon update, foreign names and terms • Use those analyzers which are flexible in this respect Prague Arabic Dependency Treebank: Introduction & Related Projects
Elementary Syntax Issues • Mostly VSO in verbal sentences, but … • … not so in clauses with non-verbal predication • … neither if topicalizers are present • Non-verbal predication of several types • Verbal nature of some nominal formations • Grammatical co-reference, accusative of the inner object • Complex referencing, rich expressions Prague Arabic Dependency Treebank: Introduction & Related Projects
da~ma [Pred] lasted iqtira~Hu [Sb] proposal sa~Eatayni [Adv] two-hours[acc.] ‑hu [Atr] his al-Eamali~yata [Obj] the-operation[acc.] Eala~ [AuxP] on kabi~run [Pnom] a-big zumala~’i [Obj] colleagues al-baytu [Sb] the-house ‑hi [Atr] his la- [PredP] for -hu [Obj] him baytun [Sb] a-house[nom.] a~mili~na [???] hoping[acc.] qubu~la [Obj] accepting[acc.] -kum [Atr] your daEwata [Obj] invitation[acc.] -na~ [Atr] our Dependency Formalism Prague Arabic Dependency Treebank: Introduction & Related Projects
Non-terminal nodes + Text tokens Constituent labeling on non-terminals Slots and traces Linguistic Data Consortium, University of Pennsylvania Sentence root node + Text tokens Analytical function for every tree node Government and roles CCL & IFAL & ICL, Charles University in Prague Constituency X Dependency Prague Arabic Dependency Treebank: Introduction & Related Projects
Trace of the antecedent subject Compound function of the head of the clause – outer and inner perspectives Free word-order compliant Model Arabic Phrase I Prague Arabic Dependency Treebank: Introduction & Related Projects
Sister-like co-ordination Conjunction of co-ordination Status constructus Model Arabic Phrase II Prague Arabic Dependency Treebank: Introduction & Related Projects
LDC Arabic Treebanking • Arabic Treebank: Part 1, version 2.0 (syntax) • 160,275 words, 4,113 trees • Arabic Treebank: Part 2, version 1.0 (morphology) • 144,199 words, 2,591 paragraphs • Arabic Treebank: Part 1, Arabic-English Parallel • 10K-word parallel translation • Arabic Gigaword • Agence France Presse, Al Hayat, Al Nahar, Xinhua • 391,619,000 words, 1,256,719 documents Prague Arabic Dependency Treebank: Introduction & Related Projects
PADT Annotation Progress • AFP Data Exchange Experiment • Dependency annotation of LDC’s ~10k words • 12,936 nodes, 374 trees (34.6 nodes per tree) • Additional Xerox morphological annotation • UMMAH Corpus Annotation • Morphology with the LDC tools, ~50k words • 45,212 nodes, 1,039 trees (43.5 nodes per tree) • Dependency annotation, ~30k words ready • 28,352 nodes, 646 trees (43.9 nodes per tree) Prague Arabic Dependency Treebank: Introduction & Related Projects
Algorithm Progress • Constituency—Dependency Transformation • Based on the AFP Exchange Experiment • EACL ’03 Research Note • Arabic Dependency Parser & Analytical Function Assignment • Incorporated into the annotation process • Machine-learning methods involved Prague Arabic Dependency Treebank: Introduction & Related Projects
Application Progress • TrEd Tree Editor • Highly powerful and reusable annotation tool • NetGraph Tree Search • Extra version for Arabic • Server/Client system architecture • Perl Modules • AG2MorphoXML, MorphoMap, Encode::Arabic Prague Arabic Dependency Treebank: Introduction & Related Projects
TrEd Tree Editor • Perl and Perl/Tk interactive application or batch processor • General editor for trees and tree-like graphs • Analytical dependency annotation • Tectogrammatical dependency annotation • Phrase-structure trees, MT solution forests … • Comparison of parser/human results • Language and platform independent Prague Arabic Dependency Treebank: Introduction & Related Projects
NetGraph Tree Search • Java client, C server • Interactive tree search, viewing, counting … • Query in the form of a generalized subtree • Server-side data search, client-side rendering • Dependency trees, phrase-structure trees, trees • Linguistic research, verifying of hypotheses • Quick & easy system, language and platform independent Prague Arabic Dependency Treebank: Introduction & Related Projects
Perl Modules • AG2MorphoXML • Token reconstruction from morpheme sequence • Various readings/annotations, Prague XML • MorphoMap • Conversion from AraMorph multi-word POS tags to positional/bit-vector compact description • Encode::Arabic • Incorporation of Buckwalter and ArabTeX transliterations into the useful Encode module Prague Arabic Dependency Treebank: Introduction & Related Projects
Related Projects Prague/Penn • Tectogrammatical description guidelines • Excellent PhD students joining the project • Taggers, parsers, tree-node classifiers • AraMorph re-implementation, spell-checkers • Dictionaries, on-line or printed • Projects in CR, USA and the Netherlands • ACE named entity annotation • Currently in LDC, “included” in tectogrammatics • LDC’s CallHome & CallFriend for Arabic Dialects Prague Arabic Dependency Treebank: Introduction & Related Projects
Workshops Penn/Prague • Philadelphia, July 2002 • Setting up, POS tool demo, intro to descriptions • AFP data exchange experiment • Prague, May 2003 • Reports, tutorials on applications and theories • Morphology improvements, Arabic Gigaword • Tool exchange and data revision plans • Lisbon, April 2004 • Open workshop proposed for the LREC ’04 • Publication of the projects, the results & consequences Prague Arabic Dependency Treebank: Introduction & Related Projects