Iterative Translation Disambiguation for Cross-Language Information Retrieval

Iterative Translation Disambiguation for Cross-Language Information Retrieval Advisor : Dr. Hsu Presenter : Yu-San Hsieh Author : Christof Monz and Bonnie J. Dorr 2005.SIGIR.520-527

Outline • Motivation • Objective • Approach • Experiment Result • Introduction • Experiment • Conclusions

Motivation • Many words or phrases in one language can be translated into another language in a number of way, so translation ambiguity is very common ,that impacting the effectiveness of information retrieval. Elfmeter (Soccer) Penalty (English) Strafe (punishment)

Objective • Finding a proper distribution of translation probabilities that can solve the translation ambiguity problem.

europa europe gewerbe geschaeft union handel gewerkschaft union trade Approach • Find a proper of translation probabilities. • Computing Term Weight • Initialization Step • Iteration Step • Normalization Step • All term weights in a vector • Iteration Stop

Measuring association strength Pointwise mutual information Dice coefficient Log Likelihood ratio Approach

Experiment Result baseline Improve Differences Individual queries (topic)

Introduction • Two techniques for cross-language retrieval • Translate collection of document into target language and apply monolingual retrieval • Translate the query into target language and apply translated query retrieval • Three approach may be used produce the translations • Machine translation system • Dictionary • Parallel corpus to estimate the probabilities

Introduction • One language translation into another language in a number ways. • Penalty (English) => Elfmeter (soccer) or Strafe (punishment)

Introduction • A approach can solve the problem of word selection is to use co-occurrences between term. • Problem (a larger number of terms) • Data-sparseness • Use very large corpora for counting co-occruences frequencies • Use internet search engines • Smoothing

Experiment • Test Data • CLEF 2003 English to German bilingual data • Choice 56 topic (title, description, narrative) • Morphological Normalization • Source-language word (topic) normalized to match in bilingual dictionary • De-compounding：5-grams • Assign weights to 5-gram substrings

Experiment • Retrieval Model • Lnu.Itc weighting scheme • Weighted document similarity • Statistical Significance • Bootstrap method • Bootstrap sample • One-tailed significance testing (compare two retrieval method)

Experiment • Found some problem in experiment • Individual average precision of Log Likelihood ratio decreases for a number of query. • Unknown word • The original word from the source language is include in the target language query. • Example • Women’s Conference Beijing Result 1.Woman control document simliarity 2.Most top-ranked documents contain Women as the only matching term. Not find : Woman Women Assign weighted =1 Women (專有名詞) normalized Women

Conclusions • Our approach improve retrieval effectiveness compare to baseline using bilingual dictionary lookup. • Experimental result show that Log Likelihood Ratio has the strong positive impact.

My opinion • Advantage: • It only requires a bilingual dictionary and a monolingual corpus in the target language. • Disadvantage: • Unknown word • Apply

Iterative Translation Disambiguation for Cross-Language Information Retrieval