Finding parallel texts on the web using cross-language information retrieval

Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington

An early parallel text

Uses for Parallel Corpora • Parallel corpora are valuable resources for natural language processing (NLP) • Machine translation • Cross-lingual information retrieval (IR) • E.g. PanImages from the University of Washington • Cross-lingual image search system • http://www.panimages.org/ • Computer Aided Human Translation • Monolingual NLP via information projection • …

What is a Parallel Text or Parallel Corpus? • Translated text/documents in two languages • Ideally sentence-aligned (e.g. using method from Gale & Church 1993)

Examples for Parallel Corpora • EUROPARL - European parliament proceedings • 10 language pairs • About 44 million words/language • Canadian parliament proceedings (Hansard) • English – French • Software documentation in multiple languages • …

Motivation • Problem: Parallel corpora exist only for a limited set of language pairs • Problem: Available parallel corpora are often very domain-specific • Problem: Available parallel corpora are often small • Task: Finding parallel texts on the Web

Example & Walk-Through • Previous work: • Ma and Liberman (1999) • Chen and Nie (2000) • Resnik and Smith (2003)

Main Steps in Identifying Parallel Text on the Web (Resnik and Smith, 2003) • Locating pages that might have parallel translations • Generating candidate page pairs that might be translations • Filtering out of non-translation candidate pairs

Our approach • Locating pages that might have parallel translations: • Sampling by sending queries • Generating candidate page pairs that might be translations: • Comparing URLs with different matching methods • Filtering out of non-translation candidate pairs: • Combining structural and content-based filtering

System Overview

Outline • System description (1a) Sampling the source language L1 (1b) Checking pages in the target language L2 (2) Matching pages in L1 and L2 (3) Filtering page pairs • Experiments • Conclusion and future work

(1a) One-term Sample • Sample • Search engine query of one term • Limited to pages in source language • Optional parameter: inurl:<2-letter language ID> • Submitted to search engine API • Search engine does automatic stemming • 100 pages in result set

(1a) Choosing Terms • Dictionary • Built using Giza++ word alignment tool • Trained on years 2001-2003 of the Europarl corpus • Contains IBM Model 1 translation probabilities • Sampling term • Selected from source language vocabulary • Mid-frequency term • Selected at random using a normal distribution • Goal: Avoid domain-specificity

(1a) Source Language Expansion • From one-term to n-term queries • Common IR query expansion technique • Based on page summaries returned by the one-term sampling query • Summary terms ranked by frequency • Leads to semantically related terms because of relevancy ranking of search engine results “shannon” → “information claude” “inconveniences” → “security travelers” • Original term expanded with one or more expansion terms re-submitted to search engine

(1b) Checking Query • Sampling query terms translated using the Giza++ dictionary “inconvenience security travelers” → “unannehmlichkeit sicherheit” • m-best translations of n sampling terms lead to mn checking queries • Optional parameter: inurl:<2-letter language ID>

(1b) Target language expansion • Alternative to translating a complete n-term sampling query • Only translate original one-term sample • Expand on target language side equivalently as on source language side • m checking queries instead of mn • Efficiency vs. source language expansion evaluated in experiments

(1b) “site:” Parameter • Optional: site parameter • Allows sites retrieved in checking query to be restricted to sites returned in sampling query • Search engine limits to sites of first 30 sampling query page results

(2) Matching URLs with Fixed Language List • URLs from corresponding sampling and checking result sets • Considered a match if they only differ in a in a fixed list of language IDs

(2) Matching URLs with Levenshtein Distance • Levenshtein distance • Also known as “edit distance” • URLs from corresponding sampling and checking result sets • Considered a match when URLs have a Levenshtein distance less or equal than 4, but larger than 0 http://ec.europa.eu/education/policies/rec_qual/recognition/diploma_en.html http://ec.europa.eu/education/policies/rec_qual/recognition/diploma_de.html

(2) URL part substitution • Sampling L1  source URLs • Replacing L1 names/ids in each source URL with L2 names/ids  target URLs • Checking whether the target URLs exist • Does not require checking queries!

(3) Filtering page pairs • Structural filtering (Resnik and Smith, 2003) • Content translation metric (Ma and Liberman, 1999) • Linear combination

(3) Linearization Linearized File HTML file [START:HTML] [START:HEAD] [START:META] [Chunk: 12] [END:META] [START:TITLE] [Chunk:25] [END:TITLE] [END:HTML] <HTML> <HEAD> <META> … </META> <TITLE> ….. </TITLE> </HTML>

(3) Alignment Linearized Source File Linearized Target File [START:HTML] [START:HEAD] [START:META] [END:META] [START:LINK] [END:LINK] [START:TITLE] [Chunk:68] [END:TITLE] [START:META] … [START:HTML] [START:HEAD] [START:META] [END:META] [START:TITLE] [Chunk:58] [END:TITLE] [START:META] …

(3) Structural Metrics • Difference percentage (dp) • Measures how different markup in linearized files is • Based on longest common subsequence algorithm (Hunt & McIlroy 1976) • Implemented using diff tool • Length correlation of aligned non-markup chunks (r) • Pearson correlation coefficient over all aligned chunks in a file pair • Length of content in characters

(3) Content Translation Metric • Calculated on first 500 content words on page • Using the Giza++ translation dictionary

(3) Combining Two Kinds of Metrics • Structural metrics: dp and r • Content-based metric: c • Linear combination:

Different settings for experiments (1a) Sampling the source language L1 • Source expansion • The “inurl:” parameter (1b) Checking pages in the target language L2 • Target expansion • The “inurl:” and “site:” parameter • Matching pages in L1 and L2 • Using a fixed list • Edit distance • URL part substitution

Experiment Results – Matches

Observations – Sampling and Checking • Query expansion increases the number of page pairs • Source and target query expansion lead to similar results • Difference between n=2 and n=3 is not significant • Possible explanation: Larger semantic divergence of queries on the source and target language sides • Using site: and inurl: search parameters increases the number of discovered page pairs • But: structural parameters might miss candidate pairs that don’t follow pattern

Observations – Matching • Number of page pair candidates • URL part substitution >> Levensthein distance • Levenshtein distance > Fixed language list • Matching methods that use checking queries are heavily impacted by relevancy rankings • Levenshtein distance matching method • Allows learning of URL patterns used for parallel pages

Experiment Results – Filtering

Observations – Filtering • Combined filter • Evaluated in comparison to human judge • Precision “How many did we get right?” • 88.9% • Encouraging on noisy test set • Recall “How many did we miss?” • 36.4% • Low recall can be compensated for by submitting more queries

Conclusions • It is possible reliably gather parallel pages using commercial search engines • Even though there are no standard features identifying these pages • Despite the relevance ranking of commercial search engine results

Future work • To improve the precision and recall of the filtering step. • To address the relevancy ranking and the page limit problem • To study whether some queries are more productive than others • To test the usefulness of the collected page pairs on applications such as MT.

Additional Slides

Experiments & Results

Languages on the Web Source: http://www.glreach.com/globstats/index.php3

Questions Asked • How do we find parallel pages in the sea of mostly monolingual pages? • What is the share of parallel pages for a given language pair?

Estimating the Percentage of Parallel Pages for a Language Pair P(DE|E)=0.03% P(ED|D)=0.27%

References • IJCNLP 2008 paper and presentation • http://search.iiit.ac.in/CLIA2008/accepted_papers.php • Email • mailto:achim@digitalsilkroad.net • MSR internship information • http://research.microsoft.com/aboutmsr/jobs/internships/default.aspx

Finding parallel texts on the web using cross-language information retrieval

Finding parallel texts on the web using cross-language information retrieval

Presentation Transcript

Finding Information on the web

Information Retrieval on the Semantic Web Using Ontology-based Visualization

Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

Cross-Language Information Retrieval

Cross-Language Information Retrieval

Cross Language Information Retrieval (CLIR)

Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

Iterative Translation Disambiguation for Cross-Language Information Retrieval

Web Information Retrieval

Cross-Language Retrieval

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Evaluating Cross-language Information Retrieval Systems

Vietnamese-English Cross Language Search Information Retrieval (CLIR) -

Iterative Translation Disambiguation for Cross Language Information Retrieval

Cross-Language Retrieval

Cross-Language Retrieval

Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

Cross-Language Information Retrieval (CLIR)