390 likes | 665 Views
Web Dynamics and their Ramifications for the Development of Web Search Engines. ... WUML: A Web Usage Manipulation Language For Querying Web Log Data. ...
E N D
Slide 1:Searching Web Better
Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology Add the logo of HKUSTAdd the logo of HKUST
Slide 2:Outline
Introduction Main Techniques (RSCF) Clickthrough Data Ranking Support Vector Machine Algorithm Ranking SVM in Co-training Framework The RSCF-based Metasearch Engine Search Engine Components Feature Extraction Experiments Current Development Use red colour to highlight the important keywords (once), then you could tell The main idea when you come across the words: SVM, ESCF, Co-training, Metasearch EngineUse red colour to highlight the important keywords (once), then you could tell The main idea when you come across the words: SVM, ESCF, Co-training, Metasearch Engine
Slide 3:Search Engine Adaptation
Google, MSNsearch, Wisenut, Overture, Computer Science Finance Social Science Adapt the search engine by learning from implicit feedback ---- Clickthrough data CS terms Product News I change the arrow directions. Here the engine you may write down the engine we used: Google, MSN, Overture Highlight clickthrough dataI change the arrow directions. Here the engine you may write down the engine we used: Google, MSN, Overture Highlight clickthrough data
Slide 4:Clickthrough Data
Clickthrough data: data that indicates which links in the returned ranking results have been clicked by the users Formally, a triplet (q, r, c) q the input query r the ranking result presented to the user c the set of links the user clicked on Benefits: Can be obtained timely No intervention to the search activity q, r,c should be italic Benefits q, r,c should be italic Benefits
Slide 5:An Example of Clickthrough Data
Users input query Clicked by the user l l l l l l l l Overlay with l1, l7, l10 in the animation Overlay with labelled and unlabelled (region with colours for example)Overlay with l1, l7, l10 in the animation Overlay with labelled and unlabelled (region with colours for example)
Slide 6:Target Ranking (Preference Pairs Set )
Slide 7:An Example of Clickthrough Data
Users input query Clicked by the user l l l l l l l l Labelled data set Unlabelled data set
Labelled data set: l1, l2, , l10 Unlabelled data set: l11, l12,Slide 8:Target Ranking (Preference Pairs Set )
Slide 9:The Ranking SVM Algorithm
Three links, each described by a feature vector Target ranking: l1 <r l2 <r l3 Weight vector -- Ranker Distance between two closest projected links Cons: It needs a large set of labelled data l2 l1 l3 l2 l1 l3 l2 l1 l3 I cant see \delta in the slide! Fill in the animiation of the projected points on W1 and W2 It seems W and \alpha are confused How large the set of labelled data?I cant see \delta in the slide! Fill in the animiation of the projected points on W1 and W2 It seems W and \alpha are confused How large the set of labelled data?
Slide 10:The Ranking SVM in Co-training Framework
Divide the feature vector into two subvectors Two rankers are built over these two feature subvectors Each ranker chooses several unlabelled preference pairs and add them to the labelled data set Rebuild each ranker from the augmented labelled data set Labelled Preference Feedback Pairs P_l Unlabelled Preference Pairs P_u Ranker a_B Training Selecting confident pairs Ranker a_A Augmented pairs Augmented pairs
Slide 11:Some Issues
Guideline for partitioning the feature vector After the partition each subvector must be sufficient for the later ranking Number of rankers Depend on the number of features When to terminate the procedure? Prediction difference: indicates the ranking difference between the two rankers After termination, get a final ranker on the augmented labelled data set
Slide 12:Metasearch Engine
Receives query from user Sends query to multiple search engines Combines the retrieved results from the underlying search engines Presents a unified ranking result to user User Metasearch Engine Search Engine 1 query Search Engine 2 Search Engine n Retrieved Results 1 Retrieved Results 2 Retrieved Results n Unified Ranking Result The diagram is too boring, use more colourful boxesThe diagram is too boring, use more colourful boxes
Slide 13:Search Engine Components
Powered by Inktomi, relatively mature One of the most powerful search engines nowadays A new but growing search engine Ranks links based on the prices paid by the sponsors on the links
Slide 14:Feature Extraction
Ranking Features (12 binary features) Rank(E,T) where E? {M,W,O} T? {1,3,5,10} (M: MSNsearch, W: Wisenut, O: Overture) Indicate the ranking of the links in each underlying search engine Similarity Features(4 features) Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) URL,Title, Abstract Cover, Abstract Group Indicate the similarity between the query and the link You should mention why these features are selected as a bulletYou should mention why these features are selected as a bullet
Slide 15:Experiments
Experiment data: within the same domain Computer science Objectives: Offline experiments compared with RSVM Online experiments compared with Google You should elaborate more on prediction error by giving an example. The first two points are minor or can be verbally mentioned during Presentation You should also mention the objectives of the on-line and off-line experimentsYou should elaborate more on prediction error by giving an example. The first two points are minor or can be verbally mentioned during Presentation You should also mention the objectives of the on-line and off-line experiments
Slide 16:Prediction Error
Prediction Error: difference between the rankers ranking and the target ranking Target ranking: l1 <r l2, l1 <r l3, l2 <r l3 Projected ranking: l2 <r l1, l1 <r l3, l2 <r l3 Prediction error = 33% l2 l1 l3 l2 l1 l3
Slide 17:Offline Experiment (Compared with RSVM)
10 queries 30 queries 60 queries The ranker trained by the RSVM algorithm on the whole feature vector The ranker trained by the RSCF algorithm on one feature subvector The ranker trained by the RSCF algorithm on another feature subvector Prediction error rise up again! The number of iterations in RSCF algorithm is about four to five!
Slide 18:Offline Experiment (Compare with RSVM)
The ranker trained by the RSVM algorithm The final ranker trained by the RSCF algorithm Overall comparison
Slide 19:Online Experiment (Compare with Google)
Experiment data: CS terms e.g. radix sort, TREC collection, Experiment Setup Combine the results returned by RSCF and those by Google into one shuffled list Present to the users in a unified way Record the users clicks You may show some example of queries in one bulletYou may show some example of queries in one bullet
Slide 20:Experimental Analysis
Slide 21:Experimental Analysis
Slide 22:Experimental Analysis
Slide 23:Conclusion on RSCF
Search engine adaptation The RSCF algorithm Train on clickthrough data Apply RSVM in the co-training framework The RSCF-based metasearch engine Offline experiments better than RSVM Online experiments better than Google I change some grammatical errorsI change some grammatical errors
Slide 24:Current Development
Features extraction and division Apply in different domains Search engine personalization SpyNoby Project: Personalized search engine with clickthrough analysis
Slide 25:If l1 and l7 are from the same underlying search engine, the preference pairs set arising from l1 should be l1 <r l2 , l1 <r l3 , l1 <r l4 , l1 <r l5 , l1 <r l6 Advantages: Alleviate the penalty on high-ranked links Give more credit to the ranking ability of the underlying search engines
Modified Target Ranking for Metasearch Engines
Labeled data set: l1, l2, , l10 Unlabelled data set: l11, l12,Slide 26:Modified Target Ranking
Slide 27:RSCF-based Metasearch Engine - MEA
User MEA query 30. ...... Unified Ranking Result q q q q 30. 30.
Slide 28:RSCF-based Metasearch Engine - MEB
User MEB query 30. Unified Ranking Result q q q q 30. 30. 30. q
Slide 29:Generating Clickthrough Data
Probability of being clicked on: k: the ranking of the link in the metasearch engine n: the number of all the links in the metasearch engine : the skewness parameter in Zipfs law Harmonic number: Judge the links relevance manually If the link is irrelevant ? not be clicked on If the link is relevant ? has the probability of Pr(k) to be clicked on
Slide 30:Feature Extraction
Ranking Features (binary features) Rank(E,T): whether the link is ranked within ST in E where E? {G,M,W,O} T? {1,3,5,10,15,20,25,30} S1={1}, S3={2,3}, S5={4,5}, S10={6,7,8,9,10} (G: Google, M: MSNsearch, W: Wisenut, O: Overture) Indicate the ranking of the links in each underlying search engine Similarity Features(4 features) Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) Measure the similarity between the query and the link
Slide 31:Experiments
Experiment data: three different domains CS terms News E-shopping Objectives: Prediction Error better than RSVM Top-k Precision adaptation ability
Slide 32:Top-k Precision
Advantages: Precision is more easier to obtained than recall Users care only top-k links (k=10) Evaluation data: 30 queries in each domain
Slide 33:Comparison of Top-k precision
CS terms News E-shopping
Slide 34:Statistical Analysis
Hypothesis Testing: (two-sample hypothesis testing about means) used to analyze whether there is a statistically significant difference between two means of two samples
Slide 35:Comparison Results
MEA can produce better search quality than Google Google does not excel in every query category MEA and MEB is able to adapt to bring out the strengths of each underlying search engine MEA and MEB are better than, or comparable to all their underlying search engine components in every query category The RSCF-based metasearch engine Comparison of prediction error better than RSVM Comparison of top-k precision adaptation ability
Slide 36:Spy Naοve Bayes Motivation
The problem of Joachims method Strong assumptions Excessively penalize high-ranked links l1, l2, l3 are apt to appear on the right, while l7, l10 on the left New interpretation of clickthrough data Clicked positive (P) Unclicked unlabeled (U), containing both positive and negative samples. Goal: identify Reliable Negatives (RN) from U lp <r ln Strong assumptions may not be necessary What are the symbols under the table?Strong assumptions may not be necessary What are the symbols under the table?
Slide 37:Spy Naοve Bayes: Ideas
Standard naοve Bayes classify positive and negative samples One-step spy naοve Bayes: Spying out RN from U Put a small number of positive samples into U to act as spies, (to scout the behavior of real positive samples in U) Take U as negative samples to train a naοve Bayes classifier Samples with lower probabilities to be positive will be assigned into RN Voting procedure: make Spying more robust Run one-step SpyNB for n times and get n sets of RNi A sample appear in at least m (m<n) sets of RNi will appear in the final RN
Slide 38:http://dleecpu1.cs.ust.hk:8080/SpyNoby/
Slide 39:My publications
Wilfred NG. Book Review: An Introduction to Search Engines and Web Navigation. An International Journal of Information Processing & Management, pp. 290-292, 43(1) (2007). Wilfred NG, Lin DENG and Dik-Lun LEE. Spying Out Real User Preferences in Web Searching. Accepted and to appear: ACM Transactions on Internet Technology, (2006). Yiping KE, Lin DENG, Wilfred NG and Dik-Lun LEE. Web Dynamics and their Ramifications for the Development of Web Search Engines. Accepted and to appear: Computer Networks Journal - Special Issue on Web Dynamics, (2005). Qingzhao TAN, Yiping KE and Wilfred NG. WUML: A Web Usage Manipulation Language For Querying Web Log Data. International Conference on Conceptual Modeling ER 2004, Lecture Notes of Computer Science Vol.3288, Shanghai, China, page 567-581, (2004). Lin DENG, Xiaoyong CHAI, Qingzhao TAN, Wilfred NG, Dik-Lun LEE. Spying Out Real User Preferences for Metasearch Engine Personalization. ACM Proceedings of WEBKDD Workshop on Web Mining and Web Usage Analysis 2004, Seattle, USA, (2004). Qingzhao TAN, Xiaoyong CHAI, Wilfred NG and Dik-Lun LEE. Applying Co-training to Clickthrough Data for Search Engine Adaptation. 9th International Conference on Database Systems for Advanced Applications DASFAA 2004, Lecture Notes of Computer Science Vol. 2973, Jeju Island, Korea, page 519-532, (2004). Haofeng ZHOU, Yubo LOU, Qingqing YUAN, Wilfred NG, Wei WANG and Baile SHI. Refining Web Authoritative Resource by Frequent Structures. IEEE Proceedings of the International Database Engineering and Applications Symposium IDEAS 2003, Hong Kong, pages 236-241, (2003). Wilfred NG. Capturing the Semantics of Web Log Data by Navigation Matrices. A Book Chapter in "Semantic Issues in E-Commerce Systems", Edited by R. Meersman, K. Aberer and T. Dillon, Kluwer Academic Publishers, pages 155-170, (2003).