330 likes | 730 Views
adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a ... Finding English-Arabic Candidate Pairs on the Internet Archive ...
E N D
Slide 1:The Web as a Parallel Corpus Philip Resnik, Noah A. Smith, Computational Linguistics, 29, 3, pp. 349 380, MIT Press,2004.
University of Maryland, Johns Hopkins University
Slide 2:Abstract STRAND system for mining parallel text on the World Wide Web
reviewing the original algorithm and results
presenting a set of significant enhancements
use of supervised learning based on structural features of documents to improve classification performance
new content-based measure of translational equivalence
adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale
construction of a significant parallel corpus for a low-density language pair
Slide 3:Introduction Parallel corpora, bitexts;
for automatic lexical acquisition (Gale and Church 1991; Melamed 1997)
provide indispensable training data for statistical translation models (Brown et al. 1990; Melamed 2000; Och and Ney 2002)
provide the connection between vocabularies in cross-language information retrieval (Davis and Dunning 1995; Landauer and Littman 1990; see also Oard 1997)
Slide 4:Recent works at UM&JHU exploit parallel corpora in order to develop monolingual resources and tools, using a process of annotation, projection, and training
given a parallel corpus in English and a less resource-rich language
project English annotations across the parallel corpus to the second language
using word-level alignments as the bridge, and then use robust statistical techniques in learning from the resulting noisy annotations
(Cabezas, Dorr, and Resnik 2001; Diab and Resnik 2002;Hwa et al. 2002; Lopez et al. 2002; Yarowsky, Ngai, and Wicentowski 2001; Yarowsky and Ngai 2001; Riloff, Schafer, and Yarowsky 2002).
Slide 5:parallel corpora as a critical resource not readily available in the necessary quantities
heavily on French-English
because the Canadian parliamentary proceedings (Hansards) in English and French were the only large bitext available
United Nations proceedings (LDC)
religious texts (Resnik, Olsen, and Diab 1999)
software manuals (Resnik and Melamed 1997; Menezes and Richardson 2001)
tend to be unbalanced, representing primarily governmental or newswire-style texts
Slide 6:World Wide Web People tend to see the Web as a reflection of their own way of viewing the world
a huge semantic network
an enormous historical archive
a grand social experiment
a great big body of text waiting to be mined
a huge fabric of linguistic data often interwoven with parallel threads
Slide 7:STRAND (Resnik 1998, 1999)
structural translation recognition acquiring natural data
identify pairs of Web pages that are mutual translations
Incorporating new work on content-based detection of translations (Smith 2001, 2002)
efficient exploitation of the Internet Archive
Slide 9:Finding parallel text on the Web Location of pages that might have parallel translations
Generation of candidate pairs that might be translations
Structural filtering out of nontranslation candidate pairs
Slide 10:Locating Pages AltaVista search engines advanced search to search for two types of Web pages: parents and siblings.
Ask AV with regular expressions
(anchor:"english" OR anchor:"anglais")
(anchor:"french" OR anchor:"francais").
spider component
The results reported here do not make use of the spider.
Slide 12:Generating Candidate Pairs with URL
http://mysite.com/english/home en.html, on which one combination of substitutions might produce the URLhttp://mysite.com/big5/home ch.html.
Another possible criterion for matching is the use of document lengths.
text E in language 1 and text F in language 2, length(E) = C * length(F), where C is a constant tuned for the language pair
Slide 13:Structural Filtering linearize the HTML structure and ignore the actual linguistic content of the documents
align the linearized sequences using a standard dynamic programming technique (Hunt and McIlroy 1975)
Slide 16:STRAND Results (1) Recall in this setting is measured relative to the set of candidate pairs that was generated
Precision
Was this pair of pages intended to provide the same content in the two different languages?
Asking the question in this way leads to high rates of interjudge agreement, as measured using Cohens measure
Slide 17:STRAND Results (2) Using the manually set thresholds for dp and n,we have obtained 100% precision and 68.6% recall in an experiment using STRAND to find English-French Web pages (Resnik 1999)
to obtain English-Chinese pairs and in a similar formal evaluation, we found that the resulting set had 98% precision and 61% recall
Slide 18:Assessing the STRAND Data a translation lexicon automatically extracted from the French-English STRAND data could be combined productively with a bilingual French-English dictionary in order to improve retrieval results using a standard cross-language IR test collection (English queries against the CLEF-2000 French collection)
backing off from the dictionary to the STRAND translation lexicon accounted for over 8% of the lexicon matches (by token)
reducing the number of untranslatable terms by a third and producing a statistically significant 12% relative improvement in mean average precision as compared to using the dictionary alone
Slide 19:30 human-translated sentence pairs from the FBIS (Release 1) English-Chinese parallel corpus, sampled at random.
30 Chinese sentences from the FBIS corpus, sampled at random, paired with their English machine translation output from AltaVistas Babelfishhttp://babel.sh.altavista.com
30 paired items from Chinese-English Web data, sampled at random from sentence-like aligned chunks as identified using the HTML-based chunk alignment process of STRAND
Slide 21:Optimizing Parameters Using Machine Learning Using the English-French data, we constructed a ninefold cross-validation experiment using decision tree induction to predict the class assigned by the human judges
The decision tree software was the widely used C5.0
Slide 22:Content-Based Matching Ma and Liberman (1999) point out, not all translators create translated pages that look like the original page
structure-based matching is applicable only in corpora that include markup
other applications for translation detection exist
a generic score of translational similarity that is based upon any word-to-word translation lexicon
Define a CL sim score, tsim
Slide 23:Quantifying Translational Similarity (Melameds [2000] Method A)
link be a pair (x, y)
maximum-weighted bipartitie matching problem O(max(|X|,|Y|)3)
Slide 25:Exploiting the Internet Archive The Internet Archive 120TB (+8TB per month) http://www.archive.org/web/researcher/ is a nonprofit organization attempting to archive the entire publicly available Web, preserving the content and providing free access to researchers, historians, scholars, and the general public.
Wayback Machine
Temporal database, but not stored in temporal order
Need decompression
Almost data are stored in compressed plain-text files
Slide 26:STRAND on the Archive Extracting URLs from index files using simple pattern matching
Combining the results from step 1 into a single huge list
Grouping URLs into buckets by handles
Generating candidate pairs from buckets
Slide 27:URLs similarity We arrived at an algorithmically simple solution that avoids this problem but is still based on the idea of language-specific substrings (LSSs).
The idea is to identify a set of language-specific URL substrings that pertain to the two languages of interest (e.g., based on language names, countries, character codeset labels, abbreviations, etc.)
Slide 29:Building an English-Arabic Corpus Finding English-Arabic Candidate Pairs on the Internet Archive
24 top-level national domains
Egypt (.eg), Saudi Arabia (.sa), Kuwait (.kw)
21 .com
emiratesbank.com, checkpoint.com
Slide 32:Future Work Weight in the dictionary can be exploited
IR IDF scores for discerning
Bootstrapping
Seed to form high-precision initial classifiers
Slide 33:Search Engine Estimate According to (Dec. 31, 2002) www.searchengineshowdown.com
Google 3,033 millions
Altavista 1,689 millions 1:1.795estimates are based on exact counts obtained from AlltheWeb multiplied by the percentage of Relative Size Showdown as compared to the number found by AlltheWeb
Our estimate for Chinese page count
ASBC?? ???? ABC ??? ??? ??????5915?term[???] (???7,844,439?)??5919?term???2000??google?altavista?relative count 5021? 2181?google?altavista?2.3?