200 likes | 334 Views
IITB @ FIRE 2010: Discriminative Models for IR. Vishal Vachhani, Shalini Gupta Joint work with Manoj Chinnakotla, Karthik Raman, Pushpak Bhattacharyya 20 th February, 2010. Roadmap. Introduction Ranking Models for IR Discriminative Models for IR Features for RankSVM Training & PRF
E N D
IITB @ FIRE 2010: Discriminative Models for IR Vishal Vachhani, Shalini Gupta Joint work with Manoj Chinnakotla, Karthik Raman, Pushpak Bhattacharyya 20th February, 2010
Roadmap • Introduction • Ranking Models for IR • Discriminative Models for IR • Features for RankSVM • Training & PRF • Cross Lingual IR • Experiments & Results • Conclusions
The Central Problem of IR • User has information need • Expresses in terms of short and often-ambiguous query • Objective of IR system • Match user query with relevant documents • Rank the documents in order of relevance
Ranking Models for IR • Vector Space Models • Probabilistic Models • Language Models
Why Discriminative Models? • Arbitrary Features • Optimization for specific Evaluation Metrics • Hence, Discriminative Models for IR.
Features for RankSVM • Statistical Features • Term Frequency • Normalized Term Frequency • Inverse Document Frequency
Features for RankSVM (..contd) • Normalized Cumulative Frequency • Normalized Term Frequency, weighted by IDF • Normalized Term Frequency, weighted by cumulative frequency
Features for RankSVM (..contd) • Co-ord Factor • Ratio of number of overlapping words to the total number of query words • Named Entity features • Capture the importance of NE in query • Pivoted length Normalization • LSI similarity • Captures Concept Based Similarity
Training • Training data in the form of ranked documents • Top preference assigned to relevant documents from relevance judgements of training queries • Add k1 random negative documents from relevance judgements • Add close irrelevant documents from • Get initial ranking (top 1K) using simple query likelihood ranking • Top k2 irrelevant documents of initial ranking • Bottom k3 irrelevant documents of initial ranking
Pseudo Relevance Feedback(PRF) • We use Zhai and Lafferty[11] PRF method in our experiments • Terms from the Top 10 documents of initial ranking were used for relevance feedback terms • Updating the features vector Vs expanding the features vector • Updating the features vector gives better accuracy
Cross Lingual IR • Hindi-English and Marathi-English Query Translation based CLIR • Transliteration • Segment based Transliteration approach for Devanagari to English • Accuracy : 71.8 % at rank -5 • Resources • Hindi-English Dictionary: 1,31,750 • Marathi-English Dictionary: 31,845
Query Disambiguation • Disambiguation algorithm • Page rank style iterative disambiguation [Christof Monz et al.] • FIRE-2008: • Link weight: Dice Coefficient • 1 candidate translation/ query word • FIRE-2010 • Link weight: similarity in LSI space • LSI establish associations between those terms that occur in similar contexts • 2 candidates translation/query word
Experiments • Monolingual: English, Hindi, and Marathi • CLIR: Hindi-English and Marathi-English • Training • 50 test queries of FIRE-2008 • SVM, Rank-SVM • Runs: • Monolingual and Cross lingual without feedback • Monolingual and Cross lingual with feedback • Runs with Title, Title+ Desc
Conclusions • Improvement in Cross-lingual IR due to improvement in transliteration and query disambiguation • About 88% of monolingual retrieval • Need further investigation on • monolingual baselines – Hindi, Marathi and English • Performance of PRF in Discriminative IR model • Effect of various features like NE, LSI based similarity, document length normalization
Acknowledgements • The first author is supported by the Infosys Fellowship Award • Project linguists at CFILT, IIT Bombay
References [1] M. K. Chinnakotla and O. P. Damani. Character sequence modeling for transliteration. In ICON 2009, IITB, India, 2009. [2] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391-407, 1990. [3] T. Joachims. Optimizing search engines using click through data. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133{142, New York, NY, USA, 2002. ACM. [4] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK, 2008. [5] C. Monz and B. J. Dorr. Iterative translation disambiguation for cross-language information retrieval. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 520-527, New York, USA, 2005. ACM. [6] R. Nallapati. Discriminative models for information retrieval, 2004.
References (Contd..) [7] I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Terrier: A High Performance and Scalable Information Retrieval Platform. In Proceedings of ACM SIGIR'06 Workshop on Open Source Information Retrieval (OSIR 2006), 2006. [8] N. Padariya, M. Chinnakotla, A. Nagesh, and O. P. Damani. Evaluation of hindi to english, marathi to english and english to hindi clir at re 2008. In FIRE 2008, IITB, India, 2008. [9] A. Singhal, C. Buckley, M. Mitra, and A. Mitra. Pivoted document length normalization. pages 21-29. [10] D. Widdows and K. Ferraro. Semantic vectors: a scalable open source package and online technology management application. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), Marrakech, Morocco, may 2008. European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/. [11] C. Zhai and J. Laerty. Model-based feedback in the language modeling approach to information retrieval. In CIKM '01: Proceedings of the tenth international conference on Information and knowledge management, pages 403{410, New York, NY, USA, 2001. ACM.