100 likes | 224 Views
Prague Arabic Dependency Treebank. MorphoTrees of Arabic and Their Annotation in the TrEd Environment. Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague. Otakar Smrž Petr Pajas. MorphoTrees … TrEd … ???.
E N D
Prague Arabic Dependency Treebank MorphoTrees of Arabic and Their Annotation in the TrEd Environment Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague Otakar Smrž Petr Pajas
MorphoTrees … TrEd … ??? • MorphoTrees mean turning unorganized sets of complex morphological analyses into hierarchies • Intuitive, decision-efficient, multi-purpose, interesting • In general, not limited to the language, nor the system of morphology, nor the levels, nor the implementation • TrEd is a fully programmable graphical editor for tree-like graphs and an excellent suite of tools for data batch processing (local/network) • Analytical and tectogrammatical dependency annotation • Viewing and converting of Arabic phrase-structure trees • Evaluating and merging of parser/tagger/human results MorphoTrees of Arabic and Their Annotation in the TrEd Environment
MorphoTrees in TrEd • Files with two types of trees • Criteria & restrictions • Automatic decisions • Hiding modes • Viewingoptions • Short-cut keys & mouse • Consist-ency checks • Processing & update macros MorphoTrees of Arabic and Their Annotation in the TrEd Environment
Arabic … the Questions • Is there syntactic difference in sawfa′arā′abā′Aḥmada and sa′as′aluwālidahu? Is there morphological difference? • The only difference is in the use of lexical units and morphs. The grammatical categories are unchanged, and morphology and syntax should clearly show this. • How do we find syntactic units? How do we get back word-forms from the lexical units and tags? • How much does improper morphological reading disturb consequent syntactic representation? • Improper in tags, lemmas, diacritics, or in tokenization? MorphoTrees of Arabic and Their Annotation in the TrEd Environment
Grapheme / Phoneme The least units capable of distinguishing meanings ~ 40 letters, context-dependent forms 28 consonants, 6 vowels Morph Composition of graphemes / phonemes Abstract derivational forms Morpheme The least unit representing some linguistic meaning Function of morphs Projection of grammatical categories Token The least syntactic unit Bearer of a uniform vector of grammatical categories Reminder of the Terms MorphoTrees of Arabic and Their Annotation in the TrEd Environment
Tim Buckwalter’s Morphology • PADT MorphoTrees are generated based on the information provided by Buckwalter Arabic Morphological Analyzer • + Updateable stem-based lexicon, finite-state model, implementation in Perl and published under GNU GPL • – Morphs, mapping only to Quasi-Functional Morphology • The tokenization, clustering, modeling of conditionality, … (wabijAnibihA) [jAnib_1] wa/CONJ + bi/PREP + jAnib/NOUN + i/CASE_DEF_GEN + hA/POSS_PRON_3FS MorphoTrees of Arabic and Their Annotation in the TrEd Environment
Xerox Morphological Analyzer MorphoTrees of Arabic and Their Annotation in the TrEd Environment
MorphoTrees Hierarchy • MorphoTrees of Arabic propose these levels • Entity – the analyzed elements of the discourse • Partitioning to the standard forms of the tokens • Non-vocalized standard orthographical forms • Lemmas/identifiers of lexical units • Tokens – syntactic units including the form and the tag • Independence on the language / implementation • More/different levels, inclusion of spelling variations, … • Annotation of various tagsets, other features of tokens • Efficiency of decision-making • Distance between analyses becomes recognized MorphoTrees of Arabic and Their Annotation in the TrEd Environment
MorphoTrees Annotation • Selecting the leaves that correspond to the proper reading of the tokens constituting the entity • Quick use of keyboard and/or mouse for annotations • Restricting the tree according to the criteria/categories required by the context • Natural control over the inheritance of restrictions • Employing automatic restrictions and annotation actions, both generic and linguistic • Learning about the discriminative categories and “human tagging” MorphoTrees of Arabic and Their Annotation in the TrEd Environment
Discussion and Conclusion • MorphoTrees • Imporant in morphological annotation and in evaluation • PADT 1.0 provides 148 000 annotated tokens • Functional Morphology • … more in Prague Arabic Dependency Treebank: Development in Data and Tools • Even its approximation is promising and welcome • Feature-Based Tagger trained on Penn ATB 2 • 3.6% error rate in major part-of-speech (15 values) • 10.8% in the full tagset (317 evidenced combinations) • 0.8–0.6% error rate in tokenization of the input MorphoTrees of Arabic and Their Annotation in the TrEd Environment