ACCURAT: Metrics for the evaluation of comparability of multilingual corpora

ACCURAT:Metrics for the evaluation ofcomparabilityof multilingual corpora Andrejs Vasiljevs, Inguna Skadina (Tilde), Bogdan Babych, Serge Sharof (CTS), Ahmet Aker, Robert Gaizauskas, David Guthrie (USFD) LREC 2010 Wokshop on Methods for the automatic acquisition of LanguageResources and theirevaluation May 23, 2010

Challenge of Data Driven MT

ACCURAT MISSION

Comparable Corpora • Non-parallel bi- or multilingual text resources • Collection of documents that are: • gathered according to a set of criteriae.g. proportion of texts of the same genre in the same domains in the same period • in two or more languages • containing overlapping information • Examples: • multilingual news feeds, • multilingual websites, • Wikipedia articles, • etc.

Wikipedia example

Comparability scale

Key questions for our research

Key objectives • To create comparability metrics - to develop the methodology and determine criteria to measure the comparability of source and target language documents in comparable corpora • To develop, analyze and evaluate methods for automatic acquisition of comparable corpora from the Web • To elaborate advanced techniques for extraction of lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT • To measure improvements from applying acquired data against baseline results from SMT and RBMT systems • To evaluate and validate the ACCURAT project results in practical applications

ACCURAT Languages • Focus on under-resourced languagesLatvian, Lithuanian, Estonia, Greek, Croatian, Romanian, Slovenian • Major translation directionse.g. English-Lithuanian. English-Croatian, German-Romanian • Minor translation directionse.g. Lithuanian-Romanian, Romanian-Greek and Latvian-Lithuanian • Methods will be adjustable to the new languages and domains and language independent where possible • Applicability of methods will be evaluated in usage scenarios

Project Partners • Tilde (Project Coordinator) - Latvia • University of Sheffield - UK • University of Leeds - UK • Athena Research and Innovation Center in Information Communication and Knowledge Technologies - Greece • University of Zagreb - Croatia • DFKI - Germany • Institute of Artificial Intelligence - Romania • Linguatec - Germany • Zemanta - Slovenia

Objectives for comparability metrics • To develop criteria and automated metrics to determine thekind and degree of comparability of comparable corpora andparallelism of documents and individual sentences within documents • To evaluate metrics designed for determining similar documents in comparable corpora • To develop a methodology to assess comparable corporacollected from the Web and to choose an alignment strategyand lexical data extraction methods.

Criteria of Comparability and Parallelism • Lack of definite methods to determine the criteria of comparability • Some attempts to measure the degree of comparability according to distribution of topics and publication dates of documents in comparable corpora to estimate the global comparability of the corpora (Saralegi et al., 2008) • Some attempts to determine different kinds of document parallelism in comparable corpora, such as complete parallelism, noisy parallelism and complete non-parallelism • Some attempts to define criteria of parallelism of similar documents in comparable corpora, such as similar number of sentences, sharing sufficiently many links (up to 30%), and monotony of links (up to 90% of links do not cross each other) (Munteanu, 2006) • Research on automatedmethods for assessing the composition of web corpora in terms of domains and genres(Sharoff, 2007)

Towards establishing metrics Metrics: intralingual and interlingual comparability for genres, domains and topics • intralingual: distance between corpora and documents withincorpora in the same language • methods: distance in feature spaces and machine learning • interlingual: distance between corpora and documents in differentlanguages • methods: dictionaries and existing MT to map feature spaces between languages Evaluation: validation of the scale by independent annotation

Criteria of Comparability and Parallelism • To investigate criteria for comparability between corpora concentrating on different sets of features: • Lexical features: measuring the degree of 'lexical overlap' between frequency lists derived from corpora • Lexical sequence features: computing N-gram distances in terms of tokens • Morpho-syntactic features: computing N-gram distances in terms of Part-of-Speech codes

Features Initial set of the features which may be used to identify the comparability between documents

Initial comparable corpora • For development of comparability metrics Initial Comparable Corpora is collected • 11 M words, 9 languages

Initial comparable corpora

ComparableTest Corpora Collected for evaluation of the comparablity metrics 34% parallel texts, 33% strongly comparable texts and 33% weakly comparable texts 9 languages, 247 000 running words

Benchmarking comparability • Problems with human labelling: • They are too coarse-grained • Symboliclabels pose problems to establish correlation with numeric scores produced by the metric • Labellingcriteria and/or human annotation may be un-systematic • Proposal to benchmark comparability metric against the score of resulting MT quality (e.g., the standard BLEU/NIST scores)

Initial experiment: Cross-lingual comparison of corpora • Mapping feature spaces using bilingual dictionaries or translation probability tables • Purpose: to see how much is the difference betweenfrequencies of words and their translations in another language • Set-up: 500 most frequent words; Relative frequencies; Bilingual translation probability tables (Europarl) • χ-score (cross-lingual intersection) • Pearson's correlation with the degree of comparability

Initial experiment • Comparability of corpora is measured in terms of lexical features (Greek—English and German—English language pairs) • The set-up is similar to (Kilgarriff, 2001): • For each corpus take the top 500 most frequent words • relative frequency is used (the absolute frequency, or the word count, divided by the length of the corpus) • Automatically generated dictionaries by Giza++ from the parallel Europarl corpus • We compare corpora pairwise using a standard Chi-Square distance measure: ChiSquare = ∑ {w1... w500}((FrqObserved - FrqExpected) ^ 2) / FrqObserved 3rd BUCCMalta22-05-10

Initial experiment • Asymmetric method: relative frequencies in Corpus in language A are treated as “expected” values, and those mapped from the Corpus in language B – as “observed”. Then we swap Corpora A and B and repeat the calculation. Asymmetry comes from words which are missing in one of the lists as compared to the other. Missing words have different relative frequencies that are added to the score, so distance from A to B can be different than from B to A. We use the minimum of these distances as the final score for the pair of corpora. 3rd BUCCMalta22-05-10

Raw scores of Chi-Square across languages for Greek-English

Collecting comparable texts from the Web • How should we collect comparable texts from the Web? • Crawling all documents in the Web and use comparability metrics to align them • Inefficient • High computational effort to align • Instead : Build a classifier/retrieval system that given a document (or certain characteristics) can find other comparable documents • Combine crawling with document pair classification or searching

Classifier • Using the initial comparable corporafor: • features extraction • training a classifier/ranking system • Predict whether pairs of documents are • parallel • strongly comparable • weakly comparable • not comparable

General Idea Features extraction EN EL parallel EN EL strongly comparable EN EL weakly comparable New Documents EN EL not comparable EN Predicted Comparability Level Initial Comparable Corpora strongly comparable EL Classifier

Evaluation • Evaluation of comparability metrics against manually annotated Comparable Test Corpora (precision, recall) • Evaluation of document level alignment methods against manually annotated Comparable Test Corpora (precision, recall) • Evaluation of sentence and phrase level alignment for corpora with different level of comparability – need for aligned data, idea to spoil parallel corpus • Automated evaluation of applicability in MT against baseline systems (BLEU, NIST) • User evaluation in gisting and post editing scenarios

Contact information: Andrejs Vasiljevs andrejs tilde.lv Tilde, Vienibas gatve 75a, Riga LV1004, Latvia www.accurat-project.eu ACCURAT project has received funding from the EU7thFramework Programme for Research and Technological Developmentunder Grant Agreement N°248347 Project duration: January 2010 – June 2012

ACCURAT: Metrics for the evaluation of comparability of multilingual corpora