1.22k likes | 1.23k Views
This talk discusses the problems with current search engines and presents a solution for personalized web search using clickthrough history. It covers various approaches such as statistical language modeling, machine learning, and ranking SVM to personalize search results. The talk also includes experiments with query log study and simulated feedback.
E N D
Personalized Web Search using Clickthrough History U. Rohini 200407019 rohini@research.iiit.ac.in Language Technologies Research Center (LTRC) International Institute of Information Technology (IIIT) Hyderabad, India
Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web Search • Personalized Search using user Relevance Feedback: Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Personalized Search using user Relevance Feedback: Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback: Simple Statistical Language modeling based method • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions
Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web Search • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions
Introduction • Current Web Search engines • Provide users with documents “relevant” to their information need • Issues • Information overload • To cater Hundreds of millions of users • Terabytes of data • Poor description of Information need • Short queries - Difficult to understand • Word ambiguities • Users only see top few results • Relevance • subjective – depends on the user One size Fits all ???
Motivation • Search is not a solved problem! • Poorly described information need • Java – (Java island / Java programming language ) • Jaguar – (cat /car) • Lemur – (animal / lemur tool kit) • SBH – (State bank of Hyderbad/Syracuse Behavioral Health care) • Given prior information • I am into biology – best guess for Jaguar? • past queries - { information retrieval, language modeling } – best guess for lemur?
Background • Prior Information – user feedback
Problem Description • Personalized Search • Customize search results according to each individual user • Personalized Search - Issues • What to use to Personalize? • How to Personalize? • When not to Personalize? • How to know Personalization helped?
Problem Statement • Problem: How to Personalize? • Our Direction: • Use past Search history • Long term learning • Sub Problems Broken down into 2 sub problems • How to model and represent past search contexts • How to use it to improve search results
Solution Outline 1. How to model and represent past search contexts • Past search history from user over a period of time – query logs • User contexts – triples : {user,query,{relevant documents}} • Apply appropriate method, learn from user contexts, build model – user profile User Profile Learning 2. How to use it to improve search results • Get Initial Search results • Take top few documents, re-score using user profile and sort again Reranking
Contributions • I Search : A suite of approaches for Personalized Web Search • Proposed Personalized search approaches • Baseline • Basic Retrieval methods • Automatic Evaluation • Analysis of Query Log • Creating Simulated Feedback
Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web Search • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions
Review of Personalized Search Personalized Search Query logs Machine learning Language modeling Community based Others
Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web Search • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions
I Search : A suite of approaches for Personalized Search • Suite of Approaches • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel Model based method • Machine learning based approach • Ranking SVM based method • Personalization without relevance feedback • Simple N-gram based method
Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web • Statistical Language modeling based approaches • Simple Language model based method • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions
Statistical Language Modeling based Approaches: Introduction • Statistical language modeling : task of estimating probability distribution that captures statistical regularities of natural language • Applied to a number of problems – Speech, Machine Translation, IR, Summarization
Statistical Language Modeling based Approaches: Background Lemur Query Formulation Model Query Given a query, which is most likely to be the Ideal Document? User Information need Ideal Document In spite of the progress, not much work to capture, model and integrate user context !
Motivation for our approach Ideal document Encyclopedia gives a brief description of the physical traits of this animal. The Lemur toolkit for language modeling and information retrieval is documented and made available for download. Information retrieval User Past Search Contexts Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which
Statistical Language Modeling based Approaches : Overview • From user contexts, capture statistical properties of texts • Use the same to improve search results • Different Contexts • Unigram and Bigrams • Simple N-gram based approaches • Relationship between query and document words • Noisy Channel based approach
Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions
N-gram based Approaches: Motivation Ideal document Lemur - Encyclopedia gives a brief description of the physical traits of this animal. The Lemur toolkit for language modeling and information retrieval is documented and made available for download. Information retrieval Past Search Contexts Unigrams Information Retrieval Documents … Bigrams Information retrieval Searching documents Information documents … Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which
Learning user profile Given Past search history Hu = {(q1, rf1), (q2, rf2), …, (qn, rfn)} • rfall = contentation of all rf • For each unigram wi • User profile
Reranking • Recall, in general LM for IR • Our Approach
Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions
Noisy Channel based Approach • Documents and Queries different information spaces • Queries – short, concise • Documents – more descriptive • Most methods to retrieval or personalized web search do not model this • We capture relationship between query and document words
Noisy Channel based approach Motivation Query Generation Process (Noisy Channel) Ideal Document Retrieval Query Generation Process (Noisy Channel)
Similar to Statistical Machine Translation • Given an english sentence translate into french • Given a query, retrieve documents closer to ideal document Noisy channel 1 English Sentence French Sentence P(e/f) Noisy Channel 2 Ideal Document Query P(q/w)
Learning user profile • User profile: Translation Model Triples : (qw,dw,p(qw/dw)) • Use Statistical Machine Translation methods • Learning user profile training a translation model • In SMT: Training a translation model • From Parallel texts • Using EM algorithm
Learning User profile • Extracting Parallel Texts • From Queries and corresponding snippets from clicked documents • Training a Translation Model • GIZA++ - an open source tool kit widely used for training translation models in Statistical Machine Translation research. U. Rohini, Vamshi Ambati, and Vasudeva Varma. Statistical machine transla- tion models for personalized search. Technical report, International Institute of Information Technology, 2007
Reranking • Recall, in general LM for IR • Noisy Channel based approach lemur P(retrieval/lemur) Lemur encyclopedia … brief … Lemur toolkit … information retireval … Lemur - Encyclopedia gives a brief description of the physical traits of this animal. The Lemur toolkit for language modeling and information retrieval is documented and made available for download. D1 : D4:
Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions
Machine Learning based Approaches:Introduction • Most machine learning for IR - Binary classification problem – “relevant” and “non-relevant” • Click through data • Click is not an absolute relevance but relative relevance • i.e., assuming clicked – relevant, un clicked - irrelevant is wrong. • Clicks – biased • Partial relative relevance - Clicked documents are more relevant than the un clicked documents.
Background • Ranking SVM • A variation of SVM • Learns from Partial Relevance Data • Learning similar to classification SVM
Ranking SVMs based method • Use Ranking SVMs for learning user profile • Experimented • Different features • Unigram, bigram • Different Feature weights • Boolean, Term Frequency, Normalized Term Frequency
Learning user profile • User profile : a weight vector • Learning: Training an SVM Model • Steps • Extracting Features • Computing Feature Weights • Training SVM 1. Uppuluri R, Ambati V, Improving web search results using collaborative filtering, In proceedings of 3rd International Workshop on Web Personalization (ITWP), held in conjunction with AAAI 2006, 2006. 2. U. Rohini and Vasudeva Varma. A novel approach for re-ranking of search results using collaborative filtering. In Proceeedings of International Conference on Computing: Theory and Applications (ICCTA’07), pages 491–495, Kolkota, India, March 2007
Extracting Features • Features : unigram, bigram Given Past search history Hu = {(q1, rf1), (q2, rf2), …, (qn, rfn)} rfall = contentation of all rf • Remove stop words from rfall • Extract all unigrams (or bigrams) from rfall
Computing Feature Weights • In each Relevant Document (di), compute weights of features: • Boolean Weighting • 1 or 0 • Term Frequency Weighting • tfw – Number of times it occurs in di • Normalized Term Frequency Weighting • tfw/ |di| |Q|
Training SVM • Each relevant document – represent as a string of features and corresponding weights • We used SVMlight for training
Sample Training Sample User Profile
Reranking • Sim(Q,D) = W. Ф(Q,D) • W – weight vector/user profile • Ф(Q,D) – vector of term and their weights • Measure of similarity between Q and D • Each term – term in the query • Term weight – product of weights in the query and the document (boolean, term frequency,normalized term frequency)
Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions
Personalized Search without Relevance Feedback:Introduction • Can personalized be done without relevance feedback about which documents are relevant • How much informative are the queries posed by users • Is information contained in the queries enough to personalize?
Approach • Past queries of the user available • Make effective use of past queries • Simple N-gram based approach
Learning user profile Given Past search history Hu = {q1 q2, qn } qconcat : Concatenation of all queries For each unigram wi • User profile
Reranking • In general LM for IR • Our Approach U. Rohini, Vamshi Ambati, and Vasudeva Varma. Personalized search without relevance feedback. Technical report, International Institute of Information Technology, 2007
Outline of the talk • Introduction • Current Search Engines – Problems • Motivation • Background • Problem Description • Solution Outline • Contributions • Review of Personalized Search • I Search : A suite of approaches for Personalized Web • Statistical Language modeling based approaches • Simple N-gram based methods • Noisy Channel based method • Machine Learning based approach • Ranking SVM based method • Personalization without Relevance Feedback • Experiments • Query Log Study • Simulated Feedback • Conclusions and Future Directions
Experiments: Introduction, Problems • Aim: To see how they perform by comparing it with a baseline • Problems • No standard evaluation framework • Data • Lack of standardization • Comparison with previous work difficult • Difficult to repeat previously conducted experiments • Difficult to share results and observations • Repeating effort to collect data over and over • Identified as a problem and need for standardization (Allan et al. 2003) • Lack of standard personalized search baselines • In our work, used a variation of the Rocchio Algorithm • Metrics