260 likes | 278 Views
Statistical Machine Translation Models for Personalized Search. Rohini U AOL India R&D, Bangalore India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA vamshi@cs.cmu.edu Vasudeva Varma,
E N D
Statistical Machine Translation Models for Personalized Search Rohini U AOL India R&D, Bangalore India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA vamshi@cs.cmu.edu Vasudeva Varma, SIEL, LTRC, IIIT Hyderabad, India vv@iiit.ac.in
Agenda • Introduction • Related Work • Background • User Profile as Translation Model • Personalized Search • Learning User Profile • Re-ranking • Experiments • Conclusions and Future Work
Introduction • Current Web Search engines • Provide users with documents “relevant” to their information need • Issues • Information overload • To cater Hundreds of millions of users • Terabytes of data • Poor description of Information need • Short queries - Difficult to understand • Word ambiguities • Users only see top few results • Relevance • subjective – depends on the user One size Fits all ???
Continued.. • Search is not a solved problem! • Poorly described information need • Java – (Java island / Java programming language ) • Jaguar – (cat /car) • Lemur – (animal / lemur tool kit) • SBH – (State bank of Hyderabad/Syracuse Behavioral Health care) • Given prior information • I am into biology – best guess for Jaguar? • past queries - { information retrieval, language modeling } – best guess for lemur?
Review of Personalized Search Personalized Search Query logs Machine learning Language modeling Community based Others
Statistical Language Modeling based Approaches: Introduction • Statistical language modeling : task of estimating probability distribution that captures statistical regularities of natural language • Applied to a number of problems – Speech, Machine Translation, IR, Summarization
Statistical Language Modeling based Approaches: Background Lemur Query Formulation Model Query Given a query, which is most likely to be the Ideal Document? P(Q/D) = P(q1….qn/D) = ΠP(qi/D) User Information need Ideal Document In spite of the progress, not much work to capture, model and integrate user context !
Noisy Channel based approach Motivation Query Generation Process (Noisy Channel) Ideal Document Retrieval Query Generation Process (Noisy Channel)
Similar to Statistical Machine Translation • Given an english sentence translate into french • Given a query, retrieve documents closer to ideal document Noisy channel 1 English Sentence French Sentence P(e/f) Noisy Channel 2 Ideal Document Query P(q/w)
Learning user profile • User profile: Translation Model Triples : (qw,dw,p(qw/dw)) • Use Statistical Machine Translation methods • Learning user profile training a translation model • In SMT: Training a translation model • From Parallel texts • Using EM algorithm
Learning User profile • Extracting Parallel Texts • From Queries and corresponding snippets from clicked documents • Training a Translation Model • GIZA++ - an open source tool kit widely used for training translation models in Statistical Machine Translation research.
Reranking • Recall, in general LM for IR • Noisy Channel based approach P(Q/D) = Π P(qi/D) lemur P(lemur/retrieval) Lemur encyclopedia … brief … Lemur toolkit … information retireval … Lemur - Encyclopedia gives a brief description of the physical traits of this animal. The Lemur toolkit for language modeling and information retrieval is documented and made available for download. D1 : D4:
Experiments • Performed evaluation on explicit feedback data collected from 7 users • Experiments • Comparison with Contextless Ranking • Comparison between different training models and contexts
Data and Set up • Data • Explicit Feedback data collected from 7 users • For each query, each user examined top 10 documents and identified top 10 documents • Collected the top 10 results for all queries. Total documents 3469 documents • Set up • 3469 documents - created lucene index. • For reranking, first retrieve the results using lucene and then rerank them using the noisy channel approach. • We perform 10 fold cross validation
Metrics • Precision@n • Number of documents relevant / n
Set up User Profile Learner Train Data User Profiles Data Test Data Reranker Reranked Results
Results I - Document Training and Document Testing II - Document Training and Snippet Testing III - Snippet Training and Document Testing IV - Snippet Training and Snippet Testing
Conclusions and Future Work • Proposed a stat MT based approach for modeling user model • Captures Richer context, relations between q and w. • In future, • N-gram based method : trigrams etc • Noisy Channel based method : bigram
References • Adam Berger and John D. Lafferty. 1999. Information retrieval as statistical translation. In Research and Development in Information Retrieval, pages 222–229. • Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Comput. Linguist., 19(2):263–311. • W. Bruce Croft, Stephen Cronen-Townsend, and Victor Larvrenko. 2001. Relevance feedback and personalization: • A language modeling perspective. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries. • Jamie Allan et. al. 2003. Challenges in information retrieval language modeling. In SIGIR Forum, volume 37 Number 1. • K. Sugiyama K. Hatano and M. Yoshikawa. 2004. Adaptive web search based on user profile constructed without any effort from users. In Proceedings of WWW 2004, page 675 684. • Victor Lavrenko and W. Bruce Croft. 2001. Relevance-based language models. In Research and Development in Information Retrieval, pages 120–127. • F. Liu, C. Yu, and W. Meng. 2002. Personalized web search by mapping user queries to categories. In Proceedings of the eleventh international conference on Information and knowledge management, ACM Press, pages 558–565. • Tom Mitchell. 1997. Machine Learning. McGrawHill.
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. • Jay M. Ponte and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Research and Development in Information Retrieval, pages 275–281. • A. Pretschner and S. Gauch. 1999. Ontology based personalized search. In ICTAI., pages 391–398. • J. J. Rocchio. 1971. Relevance feedback in information retrieval, the smart retrieval system. Experiments in Automatic Document Processing, pages 313–323. • G. Salton and C. Buckley. 1990. Improving retrieval performance by relevance feedback. Journal of the American Society of Information Science, 41:288–297. • Xuehua Shen, Bin Tan, and Chengxiang Zhai. 2005. Implicit user modeling for personalized search. In Proceedings of CIKM 2005. • F. Song and W. B. Croft. 1999. A general language model for information retrieval. In Proceedings on the 22nd annual international ACM SIGIR conference, page 279280. • Micro Speretta and Susan Gauch. 2004. Personalizing search based on user search histories. In Thirteenth International Conference on Information and Knowledge Management (CIKM 2004). • Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of ACM SIGIR’01, pages 334–342.