IITB @ FIRE 2010: Discriminative Models for IR

IITB @ FIRE 2010: Discriminative Models for IR Vishal Vachhani, Shalini Gupta Joint work with Manoj Chinnakotla, Karthik Raman, Pushpak Bhattacharyya 20th February, 2010

Roadmap • Introduction • Ranking Models for IR • Discriminative Models for IR • Features for RankSVM • Training & PRF • Cross Lingual IR • Experiments & Results • Conclusions

The Central Problem of IR • User has information need • Expresses in terms of short and often-ambiguous query • Objective of IR system • Match user query with relevant documents • Rank the documents in order of relevance

Ranking Models for IR • Vector Space Models • Probabilistic Models • Language Models

Why Discriminative Models? • Arbitrary Features • Optimization for specific Evaluation Metrics • Hence, Discriminative Models for IR.

Features for RankSVM • Statistical Features • Term Frequency • Normalized Term Frequency • Inverse Document Frequency

Features for RankSVM (..contd)‏ • Normalized Cumulative Frequency • Normalized Term Frequency, weighted by IDF • Normalized Term Frequency, weighted by cumulative frequency

Features for RankSVM (..contd)‏ • Co-ord Factor • Ratio of number of overlapping words to the total number of query words • Named Entity features • Capture the importance of NE in query • Pivoted length Normalization • LSI similarity • Captures Concept Based Similarity

Training • Training data in the form of ranked documents • Top preference assigned to relevant documents from relevance judgements of training queries • Add k1 random negative documents from relevance judgements • Add close irrelevant documents from • Get initial ranking (top 1K) using simple query likelihood ranking • Top k2 irrelevant documents of initial ranking • Bottom k3 irrelevant documents of initial ranking

Pseudo Relevance Feedback(PRF) • We use Zhai and Lafferty[11] PRF method in our experiments • Terms from the Top 10 documents of initial ranking were used for relevance feedback terms • Updating the features vector Vs expanding the features vector • Updating the features vector gives better accuracy

Cross Lingual IR • Hindi-English and Marathi-English Query Translation based CLIR • Transliteration • Segment based Transliteration approach for Devanagari to English • Accuracy : 71.8 % at rank -5 • Resources • Hindi-English Dictionary: 1,31,750 • Marathi-English Dictionary: 31,845

Query Disambiguation • Disambiguation algorithm • Page rank style iterative disambiguation [Christof Monz et al.] • FIRE-2008: • Link weight: Dice Coefficient • 1 candidate translation/ query word • FIRE-2010 • Link weight: similarity in LSI space • LSI establish associations between those terms that occur in similar contexts • 2 candidates translation/query word

Experiments • Monolingual: English, Hindi, and Marathi • CLIR: Hindi-English and Marathi-English • Training • 50 test queries of FIRE-2008 • SVM, Rank-SVM • Runs: • Monolingual and Cross lingual without feedback • Monolingual and Cross lingual with feedback • Runs with Title, Title+ Desc

Monolingual Results

Cross-Lingual Results

Conclusions • Improvement in Cross-lingual IR due to improvement in transliteration and query disambiguation • About 88% of monolingual retrieval • Need further investigation on • monolingual baselines – Hindi, Marathi and English • Performance of PRF in Discriminative IR model • Effect of various features like NE, LSI based similarity, document length normalization

Acknowledgements • The first author is supported by the Infosys Fellowship Award • Project linguists at CFILT, IIT Bombay

References [1] M. K. Chinnakotla and O. P. Damani. Character sequence modeling for transliteration. In ICON 2009, IITB, India, 2009. [2] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391-407, 1990. [3] T. Joachims. Optimizing search engines using click through data. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133{142, New York, NY, USA, 2002. ACM. [4] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK, 2008. [5] C. Monz and B. J. Dorr. Iterative translation disambiguation for cross-language information retrieval. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 520-527, New York, USA, 2005. ACM. [6] R. Nallapati. Discriminative models for information retrieval, 2004.

References (Contd..) [7] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Terrier: A High Performance and Scalable Information Retrieval Platform. In Proceedings of ACM SIGIR'06 Workshop on Open Source Information Retrieval (OSIR 2006), 2006. [8] N. Padariya, M. Chinnakotla, A. Nagesh, and O. P. Damani. Evaluation of hindi to english, marathi to english and english to hindi clir at re 2008. In FIRE 2008, IITB, India, 2008. [9] A. Singhal, C. Buckley, M. Mitra, and A. Mitra. Pivoted document length normalization. pages 21-29. [10] D. Widdows and K. Ferraro. Semantic vectors: a scalable open source package and online technology management application. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), Marrakech, Morocco, may 2008. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/. [11] C. Zhai and J. Laerty. Model-based feedback in the language modeling approach to information retrieval. In CIKM '01: Proceedings of the tenth international conference on Information and knowledge management, pages 403{410, New York, NY, USA, 2001. ACM.

Thank You

IITB @ FIRE 2010: Discriminative Models for IR