180 likes | 317 Views
A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data. Olga Pustylnikov, Alexander Mehler Bielefeld University. Motivation. Exploring similarities among languages by means of syntactic treebanks We collected a database covering 11 languages
E N D
A Unified Database of Dependency TreebanksIntegrating, Quantifying & EvaluatingDependency Data Olga Pustylnikov, Alexander Mehler Bielefeld University
Motivation • Exploring similarities among languages by means of syntactic treebanks • We collected a database covering 11 languages • Treebanks have been developed separately by different research projects • quantitative investigations on these treebanks-> the need for unification
Motivation Demands on the unified format of treebanks (+)generic:allowing to represent as many treebanks as possible (+)extensibleto new treebanks (+)complete:preserving all corpus specific information (+) transferable to other kinds of corpora (–) complex: exhibiting the minimal complexity -> graph representations
Motivation GXL (Holt et al., 2006) • Graph eXtensible Language is a graph model representig corpora in terms of graphs XML Multimodal Data GXL eGXL TOOLS WIKI Treebanks Treebanks • GXL can be applied to any kinds of corpora. (See e.g. Mehler and Gleim (2005), Ferrer i Cancho et al. (2007), Pustylnikov and Mehler (2008))
eGXL 2-level data model Types <graph id=“Types”> <node id=“POS” /> <node id=“t245” name=“VERB” /> … </graph> IDREF <graph id="Sentences"> <graph id="g8"> <node id="s8_1" form="Detta" pos="t151" /> <node id="s8_2" form="vill" pos="t245" /> ... <rel> <relend direction="in" target="s8_2" /> <relend direction="out" target="s8_1" /> </rel> ... </graph> Sentences
The eGXL Sentences-graph vill . Detta jag bestämt bemöta each token of a treebank each token of a treebank an IDREF to the POS-node of the Types-graph an IDREF to the POS-node of the Types-graph <graph id="Sentences"> <graph id="g8"> <node id="s8_1" form="Detta" pos="t151" /> <node id="s8_2" form="vill" pos="t245" /> ... <rel> <relend direction="in" target="s8_2" /> <relend direction="out" target="s8_1" /> </rel> ... </graph> word form word form a (syntactic) relation a (syntactic) relation from (e.g. a head verb) from (e.g. a head verb) to (e.g. a dependent argument) to (e.g. a dependent argument)
11 Dependency Treebanks 7 different formats
Input vs. Output Formats Examples from Dutch, Swedish, Italian treebanks
Unification is possible… … due to the separation of the core from the secondary parts <graph id=“Types”> <node id=“POS” /> <node id=“t245” name=“VERB” /> … </graph> diversity <graph id="Sentences"> <graph id="g8"> <node id="s8_1" form="Detta" pos="t151" /> <node id="s8_2" form="vill" pos="t245" /> ... <rel> <relend direction="in" target="s8_2" /> <relend direction="out" target="s8_1" /> </rel> ... </graph> commonality
The TreebankWiki http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/
Complexity of eGXL Logical Scalling Factor (LSF): number of logical elements (e.g. XML-element) required to represent a treebank unit (e.g. a word form, POS etc.) node rel other eGXL other eGXL
Conclusions • a database covering 11 languages • eGXL – a generic XML graph model adopted to syntactic treebanks • use of treebanks within a single application (Ariadne) olga.pustylnikov@uni-bielefeld.de alexander.mehler@uni-bielefeld.de ruediger.gleim@uni-bielefeld.de SFB 673 Thank you for your attention!