The Web as a Parallel Corpus

Slide 1:The Web as a Parallel Corpus Philip Resnik, Noah A. Smith, Computational Linguistics, 29, 3, pp. 349 � 380, MIT Press,2004. University of Maryland, Johns Hopkins University

Slide 2:Abstract STRAND system for mining parallel text on the World Wide Web reviewing the original algorithm and results presenting a set of significant enhancements use of supervised learning based on structural features of documents to improve classification performance new content-based measure of translational equivalence adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale construction of a significant parallel corpus for a low-density language pair

Slide 3:Introduction Parallel corpora, bitexts; for automatic lexical acquisition (Gale and Church 1991; Melamed 1997) provide indispensable training data for statistical translation models (Brown et al. 1990; Melamed 2000; Och and Ney 2002) provide the connection between vocabularies in cross-language information retrieval (Davis and Dunning 1995; Landauer and Littman 1990; see also Oard 1997)

Slide 4:Recent works at UM&JHU exploit parallel corpora in order to develop monolingual resources and tools, using a process of annotation, projection, and training given a parallel corpus in English and a less resource-rich language project English annotations across the parallel corpus to the second language using word-level alignments as the bridge, and then use robust statistical techniques in learning from the resulting noisy annotations (Cabezas, Dorr, and Resnik 2001; Diab and Resnik 2002;Hwa et al. 2002; Lopez et al. 2002; Yarowsky, Ngai, and Wicentowski 2001; Yarowsky and Ngai 2001; Riloff, Schafer, and Yarowsky 2002).

Slide 5:parallel corpora as a critical resource not readily available in the necessary quantities heavily on French-English because the Canadian parliamentary proceedings (Hansards) in English and French were the only large bitext available United Nations proceedings (LDC) religious texts (Resnik, Olsen, and Diab 1999) software manuals (Resnik and Melamed 1997; Menezes and Richardson 2001) tend to be unbalanced, representing primarily governmental or newswire-style texts

Slide 6:World Wide Web People tend to see the Web as a reflection of their own way of viewing the world a huge semantic network an enormous historical archive a grand social experiment a great big body of text waiting to be mined a huge fabric of linguistic data often interwoven with parallel threads

Slide 7:STRAND (Resnik 1998, 1999) structural translation recognition acquiring natural data identify pairs of Web pages that are mutual translations Incorporating new work on content-based detection of translations (Smith 2001, 2002) efficient exploitation of the Internet Archive

Slide 9:Finding parallel text on the Web Location of pages that might have parallel translations Generation of candidate pairs that might be translations Structural filtering out of nontranslation candidate pairs

Slide 10:Locating Pages AltaVista search engine�s advanced search to search for two types of Web pages: parents and siblings. Ask AV with regular expressions (anchor:"english" OR anchor:"anglais") (anchor:"french" OR anchor:"fran�cais"). spider component The results reported here do not make use of the spider.

Slide 12:Generating Candidate Pairs with URL http://mysite.com/english/home en.html, on which one combination of substitutions might produce the URLhttp://mysite.com/big5/home ch.html. Another possible criterion for matching is the use of document lengths. text E in language 1 and text F in language 2, length(E) = C * length(F), where C is a constant tuned for the language pair

Slide 13:Structural Filtering linearize the HTML structure and ignore the actual linguistic content of the documents align the linearized sequences using a standard dynamic programming technique (Hunt and McIlroy 1975)

Slide 16:STRAND Results (1) Recall in this setting is measured relative to the set of candidate pairs that was generated Precision �Was this pair of pages intended to provide the same content in the two different languages?� Asking the question in this way leads to high rates of interjudge agreement, as measured using Cohen�s � measure

Slide 17:STRAND Results (2) Using the manually set thresholds for dp and n,we have obtained 100% precision and 68.6% recall in an experiment using STRAND to find English-French Web pages (Resnik 1999) to obtain English-Chinese pairs and in a similar formal evaluation, we found that the resulting set had 98% precision and 61% recall

Slide 18:Assessing the STRAND Data a translation lexicon automatically extracted from the French-English STRAND data could be combined productively with a bilingual French-English dictionary in order to improve retrieval results using a standard cross-language IR test collection (English queries against the CLEF-2000 French collection) backing off from the dictionary to the STRAND translation lexicon accounted for over 8% of the lexicon matches (by token) reducing the number of untranslatable terms by a third and producing a statistically significant 12% relative improvement in mean average precision as compared to using the dictionary alone

Slide 19:30 human-translated sentence pairs from the FBIS (Release 1) English-Chinese parallel corpus, sampled at random. 30 Chinese sentences from the FBIS corpus, sampled at random, paired with their English machine translation output from AltaVista�s Babelfishhttp://babel.sh.altavista.com 30 paired items from Chinese-English Web data, sampled at random from �sentence-like� aligned chunks as identified using the HTML-based chunk alignment process of STRAND

Slide 21:Optimizing Parameters Using Machine Learning Using the English-French data, we constructed a ninefold cross-validation experiment using decision tree induction to predict the class assigned by the human judges The decision tree software was the widely used C5.0

Slide 22:Content-Based Matching Ma and Liberman (1999) point out, not all translators create translated pages that look like the original page structure-based matching is applicable only in corpora that include markup other applications for translation detection exist a generic score of translational similarity that is based upon any word-to-word translation lexicon Define a CL sim score, tsim

Slide 23:Quantifying Translational Similarity (Melamed�s [2000] Method A) link be a pair (x, y) maximum-weighted bipartitie matching problem O(max(|X|,|Y|)3)

Slide 25:Exploiting the Internet Archive The Internet Archive 120TB (+8TB per month) http://www.archive.org/web/researcher/ is a nonprofit organization attempting to archive the entire publicly available Web, preserving the content and providing free access to researchers, historians, scholars, and the general public. Wayback Machine Temporal database, but not stored in temporal order Need decompression Almost data are stored in compressed plain-text files

Slide 26:STRAND on the Archive Extracting URLs from index files using simple pattern matching Combining the results from step 1 into a single huge list Grouping URLs into buckets by handles Generating candidate pairs from buckets

Slide 27:URLs similarity We arrived at an algorithmically simple solution that avoids this problem but is still based on the idea of language-specific substrings (LSSs). The idea is to identify a set of language-specific URL substrings that pertain to the two languages of interest (e.g., based on language names, countries, character codeset labels, abbreviations, etc.)

Slide 29:Building an English-Arabic Corpus Finding English-Arabic Candidate Pairs on the Internet Archive 24 top-level national domains Egypt (.eg), Saudi Arabia (.sa), Kuwait (.kw) � 21 .com emiratesbank.com, checkpoint.com �

Slide 32:Future Work Weight in the dictionary can be exploited IR IDF scores for discerning Bootstrapping Seed to form high-precision initial classifiers

Slide 33:Search Engine Estimate According to (Dec. 31, 2002) www.searchengineshowdown.com Google 3,033 millions Altavista 1,689 millions 1:1.795estimates are based on exact counts obtained from AlltheWeb multiplied by the percentage of Relative Size Showdown as compared to the number found by AlltheWeb Our estimate for Chinese page count ASBC?? ???? ABC ??? ??? ??????5915?term[???] (???7,844,439?)??5919?term???2000??google?altavista?relative count 5021? 2181?google?altavista?2.3?

The Web as a Parallel Corpus

The Web as a Parallel Corpus

Presentation Transcript

The Web as a Parallel Corpus

The learner as corpus designer

The Web as a Parallel Corpus

Automatic Acquisition of Synonyms Using the Web as a Corpus

An Introduction to the Web as Corpus

Web Corpus Construction

What's on the Web? The Web as a Linguistic Corpus

A Web Application for Customized Corpus Delivery

Internet as a Corpus

Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus

The Web as a Parallel Corpus

The Web as a Parallel Corpus

The Web as a graph

Improved Word Alignments Using the Web as a Corpus

The Grid as a Parallel Computer

Webinar: The Web as a Workhorse

Web as a graph

CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

How Useful Is the Web as a Linguistic Corpus?

The Sawa Corpus A Parallel Corpus English - Swahili

Automatic Acquisition of Synonyms Using the Web as a Corpus

The Grid as a Parallel Computer