130 likes | 216 Views
Combining Word-Alignment Symmetrizations in Dependency Tree Projection. David Mare č ek marecek@ufal.mff.cuni.cz Charles University in Prague Institute of Formal and Applied Linguistics CICLING conference Tokyo, Japan, February 21, 2011. Motivation.
E N D
Combining Word-Alignment Symmetrizations in Dependency Tree Projection David Mareček marecek@ufal.mff.cuni.cz Charles University in Prague Institute of Formal and Applied Linguistics CICLING conference Tokyo, Japan, February 21, 2011
Motivation • Let’s have a text in a language which is not very common... • We would like to parse it, but we do not have any parser • no manually annotated treebank • But we do have a parallel corpus with another language • English
Our goal – To create a parser • Take the parallel corpus with English • Make a word-alignment on it • GIZA++ • Parse the English side of the corpus • MST dependency parser • Transfer the dependencies from English to the target language using the word-alignment • Train the parser on the resulting trees
Previous works • Rebecca Hwa (2002, 2005) • Simple algorithm for projecting trees from English to Spanish and Chinesse • Only one type of alignment used and not specified which one • K. Ganchev, J. Gillenwater, B. Taskar (2009) • Unsuprevised parser with posterior regularization, in which inferred dependencies should correspond to projected ones • English to Bulgarian
Our contribution • To show that utilization of various types of alignment improves the quality of dependency projection • GIZA++ [Och and Ney, 2003] • two uni-directonal asymmetric alignments • symmetrization methods • Simple algorithm for projecting dependencies using different types of alignment links • Training and evaluating MST parser
Word alignment • GIZA++ toolkit has asymmetric output • For each word in one language just one counterpart from the other language is found Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein . ENGLISH-to-X Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein . X-to-ENGLISH
Symmetrization methods Coordination of fiscal policies indeed , can be counterproductive . • Combinations of previous two unidirectional alignments Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein . INTERSECTION Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein . GROW-DIAG-FINAL
Which alignment to use for the projection? • We have presented four different types of alignment • ENGLISH-to-X, X-to-ENGLISH, INTERSECTION, GROW-DIAG-FINAL • We prefer X-to-ENGLISH alignment • we need to find a parent for each token in the language X • we don’t mind English words that are not aligned • We recognize three types of links • A: links that appeared in INTERSECTION alignment (red) • B: links that appeared in GROW-DIAG-FINAL and also in X-to-ENGLISH alignment (orange) • C: links that appeared only in X-to-ENGLISH alignment (blue) Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
Algorithm - example Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
Results • The best results for each of the testing languages: • English parser trained on CoNLL-X data • The projection was made on first 100.000 sentence pairs from News-commentaries (or Acquis-communautaire) parallel corpus • We used McDonald’s maximum spaning tree parser • Why is the accuracy so low? • Treebanks in CoNLL differ in annotation guidelines • Different handling of coordination structures, auxiliary verbs, noun phrases, ...
Comparison with previous work • We have run our projection method on the same datasets as in the previous work by Ganchev et al. (2009) • Bulgarian, OpenSubtitles parallel corpus • English parser trained on PennTreebank • Tested on Bulgarian CoNLL-X train sentences up to 10 words • Our results are slightly better • we did NOT use any unsupervised inference of dependency edges • we utilized better the word aligment
Conclusions • We proved that using combination of different word-alignment improves dependency tree projection • We outperform the state-of-the art results • The problem of testing is in a different anotation guidelines for each treebank