270 likes | 440 Views
Simultaneous Multilingual Search for Translingual Information Retrieval. Kristen Parton 1 Kathleen McKeown 1 James Allan 2 Enrique Henestroza 1. 1. 2. Motivation: Cross-Lingual IR. User needs to search documents in other languages. Documents. Search Results in Document Language(s).
E N D
Simultaneous Multilingual Search for Translingual Information Retrieval Kristen Parton1 Kathleen McKeown1 James Allan2 Enrique Henestroza1 1 2
Motivation: Cross-Lingual IR • User needs to search documents in other languages Documents Search Results in Document Language(s) Query in User Language الملكة رانيا العبد الله تناقش الصورة النمطية عن العرب stereotypes of Arabs
Task Redefinition: Translingual IR • User needs to search documents in other languages and get back translated results Documents Search Results in User Language Query in User Language Queen Rania Al-Abdullah discusses stereotypes of Arabs stereotypes of Arabs
Task Redefinition: Translingual IR • User needs to search documents in other languages and get back translated results • For translingual applications, integrating CLIR and result translation can improve both relevance and translation quality
Outline • Approaches to CLIR • SMLIR for Translingual IR • Query-Directed MT Post-Editing • System Evaluation • Conclusions and Future Work
Approaches to CLIR • Map query and/or documents to common representation Schwarzenegger فشل كل الاقتراحات التي عرضها شوارزينغر في استفتاء يذكر ان شوارزنجر هو ايضا نصير للحركة الأوليمبية الخاصة ... ...الى جانب النجم وحاكم ولاية كاليفورنيا ارنولد شوارزنيجر . Doc1 Doc2 Doc3
Approaches to CLIR • Map query and/or documents to common representation • Document translation (DT) + pre-translation query expansion Schwarzenegger Schwarznegger Schwartzenegger ... The failure of all proposals made by Schwarzenegger in a referendum It should be mentioned that $wArznjr is also a nasseer of the Olympic Movement […] … besides the star and the governor of the state of California Arnold Schwarznegger . Doc1 Doc2 Doc3
Approaches to CLIR • Map query and/or documents to common representation • Document translation (DT) + pre-translation query expansion • Query translation (QT) + post-translation query expansion Schwarzenegger Schwarznegger Schwartzenegger ... شفارتزنيغر شوارزنجر شوارزنيجر شوارزينيجر فشل كل الاقتراحات التي عرضها شوارزينغر في استفتاء يذكر ان شوارزنجر هو ايضا نصير للحركة الأوليمبية الخاصة ... ...الى جانب النجم وحاكم ولاية كاليفورنيا ارنولد شوارزنيجر . Doc1 Doc2 Doc3
Approaches to CLIR • Map query and/or documents to common representation • Document translation (DT) + pre-translation query expansion • Query translation (QT) + post-translation query expansion Schwarzenegger Schwarznegger Schwartzenegger ... شفارتزنيغر شوارزنجر شوارزنيجر شوارزينيجر فشل كل الاقتراحات التي عرضها شوارزينغر في استفتاء يذكر ان شوارزنجر هو ايضا نصير للحركة الأوليمبية الخاصة ... ...الى جانب النجم وحاكم ولاية كاليفورنيا ارنولد شوارزنيجر . Doc1 Doc2 Doc3
Query Translation vs. Document Translation • Trade-offs • Translation resources • Approximate DT [Oard 00], [Chen 04] • Translation quality • Handling synonymy • Hybrid methods • [McCarley 99], [Chen & Gey 04]: Run QT and DT searches, merge results and rerank • [Wang & Oard 06]: Use bidirectional word alignments to capture information from QT and DT
Hybrid Merged Method • Merge and re-rank results of two searches [McCarley 99] • DT: Query + indexed document translations • QT: Translated query + indexed source documents • Problems • Different document lengths, query lengths • Raw IR scores not comparable across queries • Many ways to re-rank, merge searches Merged Results Doc2 Doc3 Doc1
Outline • Approaches to CLIR • SMLIR for Translingual IR • Query-Directed MT Post-Editing • System Evaluation • Conclusions and Future Work
Simultaneous Multilingual IR (SMLIR) • Indexed document: source + document translation • Query: original query + query translations (+expansions) Query: شفارتزنيغرشوارزنجرشوارزنيجرشوارزينيجر Schwarzenegger Schwarznegger It should be mentioned that $wArznjr is also a nasseer of the Olympic Movement […] … besides the star and the governor of the state of California Arnold Schwarznegger . The failure of all proposals made by Schwarzenegger in a referendum يذكر ان شوارزنجر هو ايضا نصير للحركة الأوليمبية الخاصة ... ...الى جانب النجم وحاكم ولاية كاليفورنيا ارنولد شوارزنيجر . فشل كل الاقتراحات التي عرضها شوارزينغر في استفتاء Doc1 Doc2 Doc3
Simultaneous Multilingual IR (SMLIR) • Multilingual (probabilistic) structured queries • Treat query term and its translations as synonyms • SMLIR Hybrid vs. Merged Hybrid • No need for re-ranking or raw score normalization • Single index, one search • Query time comparable to Merged in practice
Outline • Approaches to CLIR • SMLIR for Translingual IR • Query-Directed MT Post-Editing • System Evaluation • Conclusions and Future Work
Relevance: Lost in Translation • Statistical MT makes mistakes • Bad translations of relevant documents may be perceived as irrelevant • Detection: IR match in source language but not in document translation → Bad translation? • Correction: Replace bad translation with query term It was the Iraqi sajidah Alry$Awy had stopped… Sajida al-Rishawi ساجدة الريشاوي وكانت العراقية ساجدة الريشاوياوقفت...
Query-Directed MT Post-Editing • Use query translation + word alignments to rewrite incorrect machine translation (MT) • Considerations: errors in query translation, incorrect word alignments It was the Iraqi sajidah Alry$Awy had stopped… It was the Iraqi Sajida al-Rishawi had stopped… Sajida al-Rishawi ساجدة الريشاوي وكانت العراقية ساجدة الريشاوياوقفت... Translated document with word alignments Edited translation
Outline • Approaches to CLIR • SMLIR for Translingual IR • Query-Directed MT Post-Editing • System Evaluation • Conclusions and Future Work
Experiment Setup • Part of Darpa GALE question-answering task • WHERE HAS [UN Secretary General Kofi Annan] BEEN AND WHEN? • Multilingual: English, Chinese, Arabic • Multimodal: speech, text; Multigenre: formal, informal • Evaluation Corpus • 102,859 Chinese documents • Translated into English using RWTH statistical machine translation system • Searches run using Indri (Lemur) IR system • Relevance judgments • 145 queries, 8,785 documents judged • A document is Relevant or Not Relevant for a query • Judgments based on Chinese text, by Chinese native speakers
Evaluation Points • Query Translation Strategies • English query Chinese query • Run SMLIR searches, evaluate results • Cross-lingual IR Approaches • Using Chinese and/or English query, search over Chinese and/or translated documents • Machine Translation Post-Editing • Detect errors in result translations • Rewrite translations
Query Translation for SMLIR • GALE queries are name-centric • Statistical machine translation (SMT) failed to translate many names in corpus • Wikipedia for name translation [Ferrandez et al. 07] • Generated by humans, “edited” by humans • Contains slang, name variations, common misspellings • Noisy, some intentional spam • Large variation in quantity/quality by language
Query Translation Strategies for SMLIR • MT dictionary: probabilistic translation dictionary derived from word alignments • Wikipedia: for name translations; not probabilistic • Combination did not help?
CLIR Evaluation • SMLIR significantly outperforms all • DT significantly better than QT • Poor performance of QT degrades Merged
Results: Query-Directed SMT Post-Editing • Post-Editing • Detect possible incorrect name translations • If translated name is not a synonym of query, rewrite name • Very conservative algorithm; does not handle deletions • Experiment • 127 queries, top 10 documents • 28 queries triggered post-editing • 15% of name matches were rewritten • Evaluation • 101 rewrites examined; 93% Acceptable, 6% Not Acceptable
Conclusions • SMLIR: Novel and effective approach for integrating document and query translation in CLIR • Query-directed SMT post-editing shows promise • More sophisticated editing possible, beyond just names • Future work: evaluating whole system for end-to-end question answering • Combining CLIR and machine translation can improve both search relevance and translation accuracy
Thank you! • This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract number HR0011-06-C-0023, in part by an NSF Graduate Research Fellowship, and in part by the Center for Intelligent Information Retrieval at the University of Massachusetts. • Thanks very much to Bob Armstrong for making the annotation happen. Thanks also to Mark Smucker and Giridhar Kumaran for help with INDRI interface and corpus issues, and Ben Carterette for help with estimated MAP. We would also like to thank the members of the NIGHTINGALE machine translation team for translation data, especially Nizar Habash and Mahmoud Ghoneim. Questions?kristen@cs.columbia.edu