280 likes | 427 Views
CLIR System Enhanced with Transliteration Generation and Mining. K Saravanan, Raghavendra Udupa & A Kumaran Microsoft Research India. CLIR System. FIRE2010 Hindi Query no. 112: गुटखा मालिकों का अन्डरवर्ल्ड के साथ उलझाव . Query Translator. Query. Dictionary. सम्बन्ध :relations
E N D
CLIR System Enhanced with Transliteration Generation and Mining K Saravanan, Raghavendra Udupa & A Kumaran Microsoft Research India
CLIR System FIRE2010 Hindi Query no. 112: गुटखा मालिकों का अन्डरवर्ल्ड के साथ उलझाव .... Query Translator Query Dictionary सम्बन्ध :relations मालिकों:owners प्रसिद्ध : famous …. Results Document Ranker Indexed Documents Telegraph India 04-07 Articles
Baseline Retrieval System • Language Model-Based Retrieval Probabilistic Translation Lexicon ~100K Hindi-English parallel sentences ~50K Tamil-English parallel sentences IBM Model 3 alignment , GIZA++ J. Jagarlamudiand A. Kumaran, Cross-Lingual Information Retrieval System for Indian Languages. Working Notes for the CLEF 2007 Workshop.
English Monolingual Results T: Title D: Description N: Narration • Monolingual performance is considered as upper limit for crosslingual performance
Basic Crosslingual Results Can we improve this?
CLIR System & OOV’s FIRE2010 Hindi Query no. 112: गुटखा मालिकों का अन्डरवर्ल्ड के साथ उलझाव .... Query Translator Query Dictionary सम्बन्ध :relations मालिकों:owners प्रसिद्ध :famous …. अन्डरवर्ल्ड :? प्रलेख :? माणिकचन्द :? …. OOVs ? Results Document Ranker Indexed Documents Telegraph India 04-07 Articles
Out-of-Vocabulary(OOV) Query Terms • Many OOV terms are named entities (NEs) • NEs are often the focus of a query • NEs form an open class of terms in all languages • Hence, getting their transliterations right may help CLIR performance • E.g. इस्राइली (israel), तसलिमा(taslima), विजयेंद्र (vijayendra), हिजबुल्लाह(hizbollah) • Many OOV terms are borrowed terms • E.g. टेन्डर (tender), अन्डरवर्ल्ड(underworld), कौसमेटिक्स (cosmetics), एनकाउंटर(encounter), गुटखा (gutkha)
OOV Terms … • With long query (TDN) setup • Hindi queries have 73 OOV terms • 31 of them are NEs or borrowed from English • Tamil have 129 OOV terms • 61 of them are NEs or borrowed from English • Nearly 50% of them may be transliterated (fact) and may improve CLIR performance (hypothesis)
Two Ways of Handling OOV terms • Transliteration Generation [Li et al., 2009; Khapra et al., 2010] • Transliterations of OOV terms are generated using an automatic Machine Transliteration system • Transliteration Mining (Udupa et al., 2009-b) • Transliterations of OOV terms are mined from the top-retrieved documents
Transliteration Generation - Direct • Based on Conditional Random Fields • Feature set include character - alignment data, source and target bigrams and trigrams • Trained on 15K source-target language parallel single word names
Transliteration Generation -Transitive • Serial combination of multiple direct transliteration systems • Useful when sufficient parallel data between source and target languages are not available, directly • [Hindi-English]=[Hindi-Kannada]+[Kannada-English] • 15K parallel single word names were used for training on each pair • For a given input, top 10 results of first system given to the second system • Outcome of second system were merged and re-ranked finally by their probability scores M. Khapra, A. Kumaran and P. Bhattacharyya, Everybody loves a rich cousin: An empirical study of transliteration through bridge languages, NAACL 2010.
Transliteration Mining • Hypothesis: • The transliterations of many OOV query terms can be found in the top results of the CLIR system for that query. • Basic Idea: • Pair the query with each of the top N results. • Treat each pair as a comparable document pair. • Mine transliteration equivalents from the comparable document pairs. R. Udupa, K. Saravanan, A. Bakalov and A. Bhole, “They are out there, if you know where to look”: Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval, ECIR 2009
Hybrid Approach (Mining + Generation) • Combination of transliteration mining and generation • First, transliterations for the OOV terms were mined from the top results of CLIR system • Second, transliterations were generated for those OOV terms for which mining couldn’t get anything
Hindi-English Crosslingual Results : T M: Transliteration Mining, GD: Transliteration Generation - Direct, GT: Transliteration Generation – Transitive
Hindi-English Crosslingual Results : TD M: Transliteration Mining, GD: Transliteration Generation - Direct, GT: Transliteration Generation – Transitive
Hindi-English Crosslingual Results :TDN M: Transliteration Mining, GD: Transliteration Generation - Direct, GT: Transliteration Generation – Transitive
Tamil-English Crosslingual Results M: Transliteration Mining, GD: Transliteration Generation - Direct
Conclusion • We presented a modular CLIR system that allows experimentation with different methodologies • Our methodologies for handling OOV terms improved crosslingual retrieval performance significantly • Our Hindi-English(TDN) crosslingual performance is 97% of the monolingual performance
Publications • Jagarlamudi, J. and Kumaran, A. 2007. Cross-Lingual Information Retrieval System for Indian Languages. Working Notes for the CLEF 2007 Workshop. • Udupa, R., Jagarlamudi, J. and Saravanan, K. 2008. Microsoft Research India at FIRE2008: Hindi-English Cross-Language Information Retrieval. Working notes for Forum for Information Retrieval Evaluation (FIRE) 2008 Workshop. • Udupa, R., Saravanan, K., Bakalov, A. and Bhole, A. 2009. "They Are Out There, If You Know Where to Look": Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval. In 31th European Conference on IR Research, ECIR 2009. • Li, H., Kumaran, A., Pervouchine, V. and Zhang, M. 2009. Report of NEWS 2009 Machine Transliteration Shared Task. Proceedings of the ACL 2009 Workshop on Named Entities (NEWS 2009), Association for Computational Linguistics, August 2009. • Khapra, M., Kumaran, A. and Bhattacharyya, P. 2010. Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. In proceedings of NAACL 2010.