150 likes | 164 Views
Explore Universal Dependencies, a framework for multilingual NLP research to ensure cross-linguistic consistency in grammatical annotation while supporting language-specific needs. Benefit from annotated treebanks, part-of-speech tags, morphological features, and dependency relationships across various languages. This open community project promotes parallelism, lexicalism, and recoverability, enabling the transparent mapping of input text to word segmentation. Join the effort to contribute and expand this resource for diverse linguistic studies.
E N D
Universal Dependencies JoakimNivre Uppsala University
Universal Dependencies • Background: • Treebank annotation schemes vary across languages • Hard to compare results across languages [Nivre et al. 2007] • Hard to evaluate cross-lingual learning [McDonald et al. 2013] • Hard to build multilingual systems • Universal Dependencies (http://universaldependencies.github.io/docs/): • Stanford universal dependencies [de Marneffeet al. 2014] • Google universal part-of-speech tags [Petrov et al. 2012] • Interset morphological features [Zeman 2008] First guidelines released Oct 1, 2014 First 10 treebanks released Jan 15, 2015
Universal Dependencies • Syntactic words – explicit splitting of clitics and contractions • Universal part-of-speech tags + morphological features • Dependency tree + augmented dependencies (not shown)
Goals • Cross-linguistically consistent grammatical annotation • Support multilingual NLP and linguistic research • Build on common usage and existing de-facto standards • Complement – not replace – language-specific schemes • Open community effort – anyone can contribute
Guiding Principles • Maximize parallelism • Don't annotate the same thing in different ways • Don't make different things look the same • Don't annotate things that are not there • Don't annotate things that are not there • Languages select from a universal pool of categories • Allow language-specific extensions
Design Principles • Dependency • Widely used in practical NLP systems • Available in treebanks for many languages • Lexicalism • Basic annotation units are words – syntactic words • Words have morphological properties • Words enter into syntactic relations • Recoverability • Transparent mapping from input text to word segmentation
Morphological Annotation • Lemma represent the semantic content of a word • Part-of-speech tag represent its grammatical class • Features represent lexical and grammatical properties of the lemma or the particular word form
Syntactic Annotattion • Content words are related by dependency relations • Function words attach to the content word they modify • Punctuation attach to head of phrase or clause
Dependency Structure • Keeping content words as heads promotes parallelism • Function words often correlate with morphology English Swedish
Dependency Relations [de Marneffeet al. 2014] • Taxonomy of 42 universal grammatical relations, broadly supported across many languages in language typology • Language specific subtypes can be added
Morphology: POS • Taxonomy of 17 universal part-of-speech tags, based on the Google Universal Tagset [Petrov et al. 2012]
Morphology: Universal Features • Standardized inventory of morphological features, based on the Interset system [Zeman 2008]
Morphology: Examples la Definite=Def|Gender=Fem|Number=Sing|PronType=Art hannoMood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin fatto Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part casa Gender=Fem|Number=Sing