Statistical Machine Translation Models for Personalized Search

Statistical Machine Translation Models for Personalized Search. Rohini U AOL India R&D, Bangalore India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA vamshi@cs.cmu.edu Vasudeva Varma,

Presentation Transcript

  1. Statistical Machine Translation Models for Personalized Search Rohini U AOL India R&D, Bangalore India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA vamshi@cs.cmu.edu Vasudeva Varma, SIEL, LTRC, IIIT Hyderabad, India vv@iiit.ac.in

  2. Agenda • Introduction • Related Work • Background • User Profile as Translation Model • Personalized Search • Learning User Profile • Re-ranking • Experiments • Conclusions and Future Work

  3. Introduction • Current Web Search engines • Provide users with documents “relevant” to their information need • Issues • Information overload • To cater Hundreds of millions of users • Terabytes of data • Poor description of Information need • Short queries - Difficult to understand • Word ambiguities • Users only see top few results • Relevance • subjective – depends on the user One size Fits all ???

  4. Continued.. • Search is not a solved problem! • Poorly described information need • Java – (Java island / Java programming language ) • Jaguar – (cat /car) • Lemur – (animal / lemur tool kit) • SBH – (State bank of Hyderabad/Syracuse Behavioral Health care) • Given prior information • I am into biology – best guess for Jaguar? • past queries - { information retrieval, language modeling } – best guess for lemur?

  5. Review of Personalized Search Personalized Search Query logs Machine learning Language modeling Community based Others

  6. Statistical Language Modeling based Approaches: Introduction • Statistical language modeling : task of estimating probability distribution that captures statistical regularities of natural language • Applied to a number of problems – Speech, Machine Translation, IR, Summarization

  7. Statistical Language Modeling based Approaches: Background Lemur Query Formulation Model Query Given a query, which is most likely to be the Ideal Document? P(Q/D) = P(q1….qn/D) = ΠP(qi/D) User Information need Ideal Document In spite of the progress, not much work to capture, model and integrate user context !

  8. Noisy Channel based approach Motivation Query Generation Process (Noisy Channel) Ideal Document Retrieval Query Generation Process (Noisy Channel)

  9. Similar to Statistical Machine Translation • Given an english sentence translate into french • Given a query, retrieve documents closer to ideal document Noisy channel 1 English Sentence French Sentence P(e/f) Noisy Channel 2 Ideal Document Query P(q/w)

  10. Learning user profile • User profile: Translation Model Triples : (qw,dw,p(qw/dw)) • Use Statistical Machine Translation methods • Learning user profile training a translation model • In SMT: Training a translation model • From Parallel texts • Using EM algorithm

  11. Learning User profile • Extracting Parallel Texts • From Queries and corresponding snippets from clicked documents • Training a Translation Model • GIZA++ - an open source tool kit widely used for training translation models in Statistical Machine Translation research.

  12. Sample user profile

  13. Reranking • Recall, in general LM for IR • Noisy Channel based approach P(Q/D) = Π P(qi/D) lemur P(lemur/retrieval) Lemur encyclopedia … brief … Lemur toolkit … information retireval … Lemur - Encyclopedia gives a brief description of the physical traits of this animal. The Lemur toolkit for language modeling and information retrieval is documented and made available for download. D1 : D4:

  14. Experiments • Performed evaluation on explicit feedback data collected from 7 users • Experiments • Comparison with Contextless Ranking • Comparison between different training models and contexts

  15. Data and Set up • Data • Explicit Feedback data collected from 7 users • For each query, each user examined top 10 documents and identified top 10 documents • Collected the top 10 results for all queries. Total documents 3469 documents • Set up • 3469 documents - created lucene index. • For reranking, first retrieve the results using lucene and then rerank them using the noisy channel approach. • We perform 10 fold cross validation

  16. Data

  17. Metrics • Precision@n • Number of documents relevant / n

  18. Set up User Profile Learner Train Data User Profiles Data Test Data Reranker Reranked Results

  19. Results

  20. Results I - Document Training and Document Testing II - Document Training and Snippet Testing III - Snippet Training and Document Testing IV - Snippet Training and Snippet Testing

  21. Conclusions and Future Work • Proposed a stat MT based approach for modeling user model • Captures Richer context, relations between q and w. • In future, • N-gram based method : trigrams etc • Noisy Channel based method : bigram

