120 likes | 281 Views
On-line Compilation of Comparable Corpora and Their Evaluation. Radu ION, Dan TUFI Ş, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute for Artificial Intelligence (RACAI) FASSBL-7 Dubrovnik, Croatia October 4—6, 2010. Introduction.
E N D
On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute for Artificial Intelligence (RACAI) FASSBL-7Dubrovnik, CroatiaOctober 4—6, 2010
Introduction Multilingual Comparable Corpora (MCC) is usually easier to find and gather than parallel corpora There are many types of MCC that pertain to the degree of relatedness: strongly, weakly, very non-parallel, etc. MCC Our working definition (Munteanu & Marcu, 2006): a set of paired documents that even though are not translations of one another, are related and convey overlapping information For instance news about your local favorite football team suffering a defeat last night
Document pairing in MCC It’s very important to acknowledge that in order to be able to use large MCC, we need to pair documents from source and target languages Suppose that we gather some type of news corpora (sports for instance) in two languages and we do that by streaming news sites in those languages Suppose that we do not keep the documents themselves and we join them into one large document Now if the source and target documents have 1M words per document (a very optimistic scenario), we will need at least 1M 1M = 1012 operations to word-align the documents ! But if we had 1000 documents with 1000 words each (in each of the languages) and managed to first align the documents, we would need 1000 10002 = 109 op.
Wikipedia as an MCC corpus Wikipedia is an extremely valuable resource in that is a free collection of (generally) good quality articles that have versions in many languages Many of the articles on Wikipedia are linked with their versions in other languages, a feature that makes it an inherently large MCC corpus English Wikipedia has 3,431,874 articles, Romanian Wikipedia has 150,797 articles We have employed two different strategies of building MCC from Wikipedia: using Romanian “quality articles” (very good quality articles that are complete, well written, approved by senior Wikipedia administrators) using Princeton English WordNet (to be explained…)
MCC from Wikipedia quality articles Having a list of Romanian quality articles … We have gathered 128 pairs of English-Romanian documents from Wikipedia (602K/502K words) using one of the following heuristics: Following the English link from the Romanian article gave us the English pair of the Romanian document English articles that had the exact same name as Romanian articles (“Alicia Keys”, “Evanescence”, etc.) We automatically translate the title of the Romanian page into an English query by using translation lexicons (we consider the first 2 translations for every Romanian content word). We retrieve the first 10 results and manually find the pair of the Romanian document but an automatic method is also available (to be described…)
MCC from Wikipedia using WordNet Using Princeton WordNet (wordnet.princeton.edu), extract a list of named entities (literals that are capitalized and usually in the “instance_of” relation with their parents) Transform these literals in Wikipedia page names by replacing spaces with underscore (“_”) and adding the Wikipedia URL prefix en.wikipedia.org/wiki/ Extract all English pages we can find and for each page, the Romanian and/or German versions if they exist by following the interlingual links We strip the HTML information from the documents retaining only the UTF-8 text and we also store the categories of each document in order to be able to select different domain corpora
Sizes of Collected MCC corpora Using the WordNet named entities method we were able to gather the following data (in thousands of words):
Document pairing in MCC The problem is to automatically pair documents (1:1 mapping) from the source language set with those in the target language set In order to do this we replaced each word in every document with its translation equivalent pairs imposing a limit of at most 3 translations and also considering only those source words that have a low translation entropy score (at most 0.5) If two candidate documents are represented as binary vectors x = (x1, x2, …, xn) and y = (y1, y2, …, yn) in which a position is 1 if the corresponding term is found in the document …
Percent disagreement d(x, y) The percent disagreement measure is the best measure that differentiates the best between good pairs and bad ones (tested against Euclidean, Squared Euclidean and Manhattan distances) We managed to obtain a 72% accuracy when aligning the 128 documents test set (the quality articles) from Romanian Wikipedia
Focused MCC crawling Usually the task of collecting corpora from the web is undertaken once and then all the related tools and resources are forgotten … Until a new corpus is expected to be built in which case, the whole suite of scripts is usually rewritten in order to cope with the new requirements In order to avoid the unnecessary duplication of work, we developed a graphical web crawler that, based on a input list of URLs, crawls the web, stores the documents in text form and, optionally, runs them through a suite of NLP tools at the user’s choice
Conclusions Comparable corpora is easier to obtain than parallel corpora and in the ACCURAT project (http://www.accurat-project.eu/), we intend to exploit comparable corpora in order to obtain parallel data that will complement and improve existing translation models We have collected around 46M words worth of English-Romanian comparable corpora and around 26M words of Romanian-German comparable corpora from Wikipedia We have also developed a generic graphic web crawler that will collect even more comparable corpora from the web