1 / 27

CLIR System Enhanced with Transliteration Generation and Mining

CLIR System Enhanced with Transliteration Generation and Mining. K Saravanan, Raghavendra Udupa & A Kumaran Microsoft Research India. CLIR System. FIRE2010 Hindi Query no. 112: गुटखा मालिकों का अन्डरवर्ल्ड के साथ उलझाव . Query Translator. Query. Dictionary. सम्बन्ध :relations

ranee
Download Presentation

CLIR System Enhanced with Transliteration Generation and Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CLIR System Enhanced with Transliteration Generation and Mining K Saravanan, Raghavendra Udupa & A Kumaran Microsoft Research India

  2. CLIR System FIRE2010 Hindi Query no. 112: गुटखा मालिकों का अन्डरवर्ल्ड के साथ उलझाव .... Query Translator Query Dictionary सम्बन्ध :relations मालिकों:owners प्रसिद्ध : famous …. Results Document Ranker Indexed Documents Telegraph India 04-07 Articles

  3. Baseline Retrieval System • Language Model-Based Retrieval Probabilistic Translation Lexicon ~100K Hindi-English parallel sentences ~50K Tamil-English parallel sentences IBM Model 3 alignment , GIZA++ J. Jagarlamudiand A. Kumaran, Cross-Lingual Information Retrieval System for Indian Languages. Working Notes for the CLEF 2007 Workshop.

  4. English Monolingual Results T: Title D: Description N: Narration • Monolingual performance is considered as upper limit for crosslingual performance

  5. Basic Crosslingual Results Can we improve this?

  6. CLIR System & OOV’s FIRE2010 Hindi Query no. 112: गुटखा मालिकों का अन्डरवर्ल्ड के साथ उलझाव .... Query Translator Query Dictionary सम्बन्ध :relations मालिकों:owners प्रसिद्ध :famous …. अन्डरवर्ल्ड :? प्रलेख :? माणिकचन्द :? …. OOVs ? Results Document Ranker Indexed Documents Telegraph India 04-07 Articles

  7. Out-of-Vocabulary(OOV) Query Terms • Many OOV terms are named entities (NEs) • NEs are often the focus of a query • NEs form an open class of terms in all languages • Hence, getting their transliterations right may help CLIR performance • E.g. इस्राइली (israel), तसलिमा(taslima), विजयेंद्र (vijayendra), हिजबुल्लाह(hizbollah) • Many OOV terms are borrowed terms • E.g. टेन्डर (tender), अन्डरवर्ल्ड(underworld), कौसमेटिक्स (cosmetics), एनकाउंटर(encounter), गुटखा (gutkha)

  8. OOV Terms … • With long query (TDN) setup • Hindi queries have 73 OOV terms • 31 of them are NEs or borrowed from English • Tamil have 129 OOV terms • 61 of them are NEs or borrowed from English • Nearly 50% of them may be transliterated (fact) and may improve CLIR performance (hypothesis)

  9. Two Ways of Handling OOV terms • Transliteration Generation [Li et al., 2009; Khapra et al., 2010] • Transliterations of OOV terms are generated using an automatic Machine Transliteration system • Transliteration Mining (Udupa et al., 2009-b) • Transliterations of OOV terms are mined from the top-retrieved documents

  10. Transliteration Generation - Direct • Based on Conditional Random Fields • Feature set include character - alignment data, source and target bigrams and trigrams • Trained on 15K source-target language parallel single word names

  11. Transliteration Generation -Transitive • Serial combination of multiple direct transliteration systems • Useful when sufficient parallel data between source and target languages are not available, directly • [Hindi-English]=[Hindi-Kannada]+[Kannada-English] • 15K parallel single word names were used for training on each pair • For a given input, top 10 results of first system given to the second system • Outcome of second system were merged and re-ranked finally by their probability scores M. Khapra, A. Kumaran and P. Bhattacharyya, Everybody loves a rich cousin: An empirical study of transliteration through bridge languages, NAACL 2010.

  12. Examples of Transliteration Generation

  13. Transliteration Mining • Hypothesis: • The transliterations of many OOV query terms can be found in the top results of the CLIR system for that query. • Basic Idea: • Pair the query with each of the top N results. • Treat each pair as a comparable document pair. • Mine transliteration equivalents from the comparable document pairs. R. Udupa, K. Saravanan, A. Bakalov and A. Bhole, “They are out there, if you know where to look”: Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval, ECIR 2009

  14. Examples of Transliteration Mining

  15. Hybrid Approach (Mining + Generation) • Combination of transliteration mining and generation • First, transliterations for the OOV terms were mined from the top results of CLIR system • Second, transliterations were generated for those OOV terms for which mining couldn’t get anything

  16. Experimental Results

  17. Hindi-English Crosslingual Results : T M: Transliteration Mining, GD: Transliteration Generation - Direct, GT: Transliteration Generation – Transitive

  18. Hindi-English Crosslingual Results : TD M: Transliteration Mining, GD: Transliteration Generation - Direct, GT: Transliteration Generation – Transitive

  19. Hindi-English Crosslingual Results :TDN M: Transliteration Mining, GD: Transliteration Generation - Direct, GT: Transliteration Generation – Transitive

  20. Tamil-English Crosslingual Results M: Transliteration Mining, GD: Transliteration Generation - Direct

  21. Conclusion • We presented a modular CLIR system that allows experimentation with different methodologies • Our methodologies for handling OOV terms improved crosslingual retrieval performance significantly • Our Hindi-English(TDN) crosslingual performance is 97% of the monolingual performance

  22. Publications • Jagarlamudi, J. and Kumaran, A. 2007. Cross-Lingual Information Retrieval System for Indian Languages. Working Notes for the CLEF 2007 Workshop. • Udupa, R., Jagarlamudi, J. and Saravanan, K. 2008. Microsoft Research India at FIRE2008: Hindi-English Cross-Language Information Retrieval. Working notes for Forum for Information Retrieval Evaluation (FIRE) 2008 Workshop. • Udupa, R., Saravanan, K., Bakalov, A. and Bhole, A. 2009. "They Are Out There, If You Know Where to Look": Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval. In 31th European Conference on IR Research, ECIR 2009. • Li, H., Kumaran, A., Pervouchine, V. and Zhang, M. 2009. Report of NEWS 2009 Machine Transliteration Shared Task. Proceedings of the ACL 2009 Workshop on Named Entities (NEWS 2009), Association for Computational Linguistics, August 2009. • Khapra, M., Kumaran, A. and Bhattacharyya, P. 2010. Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. In proceedings of NAACL 2010.

  23. Thank You

  24. Impact of M/G on Queries

  25. Queries With High Positive Impact

  26. Queries With High Negative Impact

  27. Hindi Query 112:

More Related