390 likes | 500 Views
Searching Web Better. Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology. Outline. Introduction Main Techniques ( RSCF ) Clickthrough Data Ranking Support Vector Machine Algorithm R anking S VM in C o-training F ramework
E N D
Searching Web Better Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology
Outline • Introduction • Main Techniques (RSCF) • Clickthrough Data • Ranking Support Vector Machine Algorithm • Ranking SVM in Co-trainingFramework • The RSCF-based Metasearch Engine • Search Engine Components • Feature Extraction • Experiments • Current Development
Search Engine Adaptation Social Science Computer Science Finance Product CS terms News Google, MSNsearch, Wisenut, Overture, … Adapt the search engine by learning from implicit feedback ---- Clickthrough data
Clickthrough Data • Clickthrough data: data that indicates which links in the returned ranking results have been clicked by the users • Formally, a triplet (q, r, c) • q – the input query • r – the ranking result presented to the user • c – the set of links the user clicked on • Benefits: • Can be obtained timely • No intervention to the search activity
An Example of Clickthrough Data User’s input query l l l Clicked by the user l l l l l
An Example of Clickthrough Data User’s input query l Labelled data set l l Clicked by the user l l l l Unlabelled data set l
Target Ranking (Preference Pairs Set ) • Labelled data set: l1, l2,…, l10 • Unlabelled data set: l11, l12,…
The Ranking SVM Algorithm Three links, each described by a feature vector Target ranking: l1 <r’ l2 <r’ l3 Weight vector -- Ranker Distance between two closest projected links l2 l1’ l2’ l2’ l1’ l1 l3’ l3’ l3 Cons: It needs a large set of labelled data
The Ranking SVM in Co-training Framework • Divide the feature vector into two subvectors • Two rankers are built over these two feature subvectors • Each ranker chooses several unlabelled preference pairs and add them to the labelled data set • Rebuild each ranker from the augmented labelled data set Labelled Preference Feedback Pairs P_l Training Ranker a_A Ranker a_B Augmented pairs Augmented pairs Selecting confident pairs Unlabelled Preference Pairs P_u
Some Issues • Guideline for partitioning the feature vector • After the partition each subvector must be sufficient for the later ranking • Number of rankers • Depend on the number of features • When to terminate the procedure? • Prediction difference: indicates the ranking difference between the two rankers • After termination, get a final ranker on the augmented labelled data set
Metasearch Engine User query • Receives query from user • Sends query to multiple search engines • Combines the retrieved results from the underlying search engines • Presents a unified ranking result to user Metasearch Engine Search Engine 1 Search Engine 2 Search Engine n Retrieved Results 1 Retrieved Results 2 Retrieved Results n Unified Ranking Result
Search Engine Components • Powered by Inktomi, relatively mature • One of the most powerful search engines nowadays • A new but growing search engine • Ranks links based on the prices paid by the sponsors on the links
Feature Extraction • Ranking Features (12 binary features) • Rank(E,T) where E {M,W,O} T {1,3,5,10} (M: MSNsearch, W: Wisenut, O: Overture) • Indicate the ranking of the links in each underlying search engine • Similarity Features(4 features) • Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) • URL,Title, Abstract Cover, Abstract Group • Indicate the similarity between the query and the link
Experiments • Experiment data: within the same domain – Computer science • Objectives: • Offline experiments – compared with RSVM • Online experiments – compared with Google
Prediction Error • Prediction Error: difference between the ranker’s ranking and the target ranking • Target ranking: l1 <r’ l2, l1 <r’ l3, l2 <r’ l3 • Projected ranking: l2 <r’ l1, l1 <r’ l3, l2 <r’ l3 • Prediction error = 33% l2 l2’ l1’ l1 l3’ l3
Offline Experiment (Compared with RSVM) 10 queries 30 queries 60 queries The ranker trained by the RSVM algorithm on the whole feature vector The ranker trained by the RSCF algorithm on one feature subvector The ranker trained by the RSCF algorithm on another feature subvector Prediction error rise up again! The number of iterations in RSCF algorithm is about four to five!
Offline Experiment (Compare with RSVM) Overall comparison The ranker trained by the RSVM algorithm The final ranker trained by the RSCF algorithm
Online Experiment (Compare with Google) • Experiment data: CS terms • e.g. radix sort, TREC collection, … • Experiment Setup • Combine the results returned by RSCF and those by Google into one shuffled list • Present to the users in a unified way • Record the users’ clicks
Conclusion on RSCF • Search engine adaptation • The RSCF algorithm • Train on clickthrough data • Apply RSVM in the co-training framework • The RSCF-based metasearch engine • Offline experiments – better than RSVM • Online experiments – better than Google
Current Development • Features extraction and division • Apply in different domains • Search engine personalization • SpyNoby Project: Personalized search engine with clickthrough analysis
Modified Target Ranking for Metasearch Engines • If l1 and l7 are from the same underlying search engine, the preference pairs set arising from l1 should be l1 <r l2 , l1 <r l3 , l1 <r l4 , l1 <r l5 , l1 <r l6 • Advantages: • Alleviate the penalty on high-ranked links • Give more credit to the ranking ability of the underlying search engines
Modified Target Ranking • Labeled data set: l1, l2,…, l10 • Unlabelled data set: l11, l12,…
RSCF-based Metasearch Engine - MEA User query q MEA q q q • …… • …… • ………… • ………… • 30. …… • …… • …… • ………… • ………… • 30. ...... • …… • …… • ………… • ………… • 30. …… Unified Ranking Result
RSCF-based Metasearch Engine - MEB User query q MEB q q q q • …… • …… • ………… • ………… • 30. …… • …… • …… • ………… • ………… • 30. …… • …… • …… • ………… • ………… • 30. …… • …… • …… • ………… • ………… • 30. …… Unified Ranking Result
Generating Clickthrough Data • Probability of being clicked on: k: the ranking of the link in the metasearch engine n: the number of all the links in the metasearch engine : the skewness parameter in Zipf’s law Harmonic number: • Judge the link’s relevance manually • If the link is irrelevant not be clicked on • If the link is relevant has the probability of Pr(k) to be clicked on
Feature Extraction • Ranking Features (binary features) • Rank(E,T): whether the link is ranked within ST in E where E {G,M,W,O} T {1,3,5,10,15,20,25,30} S1={1}, S3={2,3}, S5={4,5}, S10={6,7,8,9,10} …… (G: Google, M: MSNsearch, W: Wisenut, O: Overture) • Indicate the ranking of the links in each underlying search engine • Similarity Features(4 features) • Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) • Measure the similarity between the query and the link
Experiments • Experiment data: three different domains • CS terms • News • E-shopping • Objectives: • Prediction Error – better than RSVM • Top-k Precision – adaptation ability
Top-k Precision • Advantages: • Precision is more easier to obtained than recall • Users care only top-k links (k=10) • Evaluation data: 30 queries in each domain
Comparison of Top-k precision News CS terms E-shopping
Statistical Analysis Hypothesis Testing: (two-sample hypothesis testing about means) used to analyze whether there is a statistically significant difference between two means of two samples
Comparison Results • MEA can produce better search quality than Google • Google does not excel in every query category • MEAand MEB is able to adapt to bring out the strengths of each underlying search engine • MEA and MEB are better than, or comparable to all their underlying search engine components in every query category • The RSCF-based metasearch engine • Comparison of prediction error – better than RSVM • Comparison of top-k precision – adaptation ability
Spy Naïve Bayes – Motivation • The problem of Joachims method • Strong assumptions • Excessively penalize high-ranked links l1, l2, l3are apt to appear on the right, while l7, l10 on the left • New interpretation of clickthrough data • Clicked – positive (P) • Unclicked – unlabeled (U), containing both positive and negative samples. • Goal: identify Reliable Negatives (RN) from U lp<r ln
Spy Naïve Bayes: Ideas • Standard naïve Bayes – classify positive and negative samples • One-step spy naïve Bayes: Spying out RN from U • Put a small number of positive samples into Uto act as “spies”, (to scout the behavior of real positive samples in U) • Take U as negative samples to train a naïve Bayes classifier • Samples with lower probabilities to be positive will be assigned into RN • Voting procedure: make Spying more robust • Run one-step SpyNB for n times and get n sets of RNi • A sample appear in at least m (m<≈n) sets of RNi will appear in the final RN
My publications • Wilfred NG. Book Review: An Introduction to Search Engines and Web Navigation. An International Journal of Information Processing & Management, pp. 290-292, 43(1) (2007). • Wilfred NG, Lin DENG and Dik-Lun LEE. Spying Out Real User Preferences in Web Searching. Accepted and to appear: ACM Transactions on Internet Technology, (2006). • Yiping KE, Lin DENG, Wilfred NG and Dik-Lun LEE. Web Dynamics and their Ramifications for the Development of Web Search Engines. Accepted and to appear: Computer Networks Journal - Special Issue on Web Dynamics, (2005). • Qingzhao TAN, Yiping KE and Wilfred NG. WUML: A Web Usage Manipulation Language For Querying Web Log Data. International Conference on Conceptual Modeling ER 2004, Lecture Notes of Computer Science Vol.3288, Shanghai, China, page 567-581, (2004). • Lin DENG, Xiaoyong CHAI, Qingzhao TAN, Wilfred NG, Dik-Lun LEE. Spying Out Real User Preferences for Metasearch Engine Personalization. ACM Proceedings of WEBKDD Workshop on Web Mining and Web Usage Analysis 2004, Seattle, USA, (2004). • Qingzhao TAN, Xiaoyong CHAI, Wilfred NG and Dik-Lun LEE. Applying Co-training to Clickthrough Data for Search Engine Adaptation. 9th International Conference on Database Systems for Advanced Applications DASFAA 2004, Lecture Notes of Computer Science Vol. 2973, Jeju Island, Korea, page 519-532, (2004). • Haofeng ZHOU, Yubo LOU, Qingqing YUAN, Wilfred NG, Wei WANG and Baile SHI. Refining Web Authoritative Resource by Frequent Structures. IEEE Proceedings of the International Database Engineering and Applications Symposium IDEAS 2003, Hong Kong, pages 236-241, (2003). • Wilfred NG. Capturing the Semantics of Web Log Data by Navigation Matrices. A Book Chapter in "Semantic Issues in E-Commerce Systems", Edited by R. Meersman, K. Aberer and T. Dillon, Kluwer Academic Publishers, pages 155-170, (2003).