200 likes | 328 Views
Relevance Propagation for Web Search. Dr. Tie-Yan Liu Web Search and Mining Group Microsoft Research Asia Joint Work with Tao Qin, Tsinghua University. Outline. Introduction Generic framework for relevance propagation Evaluations Effectiveness analysis Complexity analysis Conclusions.
E N D
Relevance Propagation for Web Search Dr. Tie-Yan Liu Web Search and Mining Group Microsoft Research Asia Joint Work with Tao Qin, Tsinghua University.
Outline • Introduction • Generic framework for relevance propagation • Evaluations • Effectiveness analysis • Complexity analysis • Conclusions DCWC 2006
Introduction • Web Search ≠ Information Retrieval • Beside the content relevance, various structure information also plays an important role in Web search • Hyperlink graph • Local sitemap • Webpage layout DCWC 2006
Introduction • Three ways of utilizing the structure information for Web search • Linear combination of content relevance and importance scores computed from hyperlink graph • β∙Relevance + (1-β)∙ PageRank • Enhance link analysis with the help of content relevance • Query-dependent link graph in HITS • Topic-sensitive PageRank • Propagate content relevance along the Web structure • The use of anchor text in Search Engines • Hyperlink-based relevance score propagation (TREC 2003) • Sitemap-based feature propagation (TREC 2004) DCWC 2006
Hyperlink-based Relevance Score Propagation (Zhai et al, TREC2003) • Assumption • Hyperlinked pages have correlated content outlinks links DCWC 2006
Original relevance score Propagation from the inllinks Propagation from the outlinks Hyperlink-based Relevance Score Propagation (Zhai et al, TREC2003) • Assumption • Hyperlinked pages have correlated content • Propagation model • Weighted inlink model • Weighted outlink model • Uniform outlink model DCWC 2006
Sitemap-based Feature Propagation (Liu and Qin, TREC2004) • Assumption • Child pages are extensions of their parent page • One should consider the contribution of the child pages while computing the relevance of the parent page to a query. • Propagation model DCWC 2006
Generic Relevance Propagation Framework • Modification of the sitemap-based feature propagation model • Reminder of the hyperlink-based propagation model • A generic framework to cover both hyperlink-based and sitemap-based propagations DCWC 2006
Hyperlink-based Feature Propagation Model • Weighted inlink model • Weighted outlink model • Uniform outlink model Sitemap-based Score Propagation Model More Derived Propagation Models DCWC 2006
Summary: All Models Covered by the Generic Framework DCWC 2006
Corpora .GOV 1M pages Queries: TD 2003, 2004 MSN 2M pages Query: 100 most popular queries from MSN query log Base Ranking function BM2500 Benchmark Datasets DCWC 2006
Experimental Results (1) TREC 2003 DCWC 2006
Experimental Results (2) TREC 2004 DCWC 2006
Experimental Results (3) MSN DCWC 2006
Conclusions on Effectiveness • In general, relevance propagation can boost the search performance with proper parameter settings; • The sitemap-based models are more effective than the hyperlink-based models; • Hyperlinks ≠ Content Correlation, while the pages in the same sub site usually talk about correlated topics. • Detailed comparisons • The two sitemap-based models have similar performance. • Among the hyperlink-based models, the HF-WI model performs best. DCWC 2006
Online Complexity • w is the size of the working set, q is the number of query terms, l is the average number of inlinks / outlinks, t is the number of iterations. • For the SS model, the complexity is O(w), • The SS model needs to propagate the relevance score of a page to its parent only once if we conduct the propagation from the leaf nodes in a bottom-up manner. • For the SF model, the complexity is O(qw). • For the HS models, the complexity is O(twl) • In each step of t iterations of the HS models, we need to propagate the relevance score of a page along its in-link or out-link in the sub graph of the working set. • For the HF models, the complexity is O(tqwl). DCWC 2006
Online Complexity • The sitemap-based models are more efficient than the hyperlink-based models • The score-level propagation models are faster than feature-level models DCWC 2006
Offline Complexity • Score-level propagation is very difficult to implement offline • The score can only be computed online w.r.t the query. • For feature-level propagations, • The time complexity of the SF model for offline implementation is acceptable; • 62.2 hours, or 2.6 days to re-index 8 billion pages • The time complexity of the HF model is out of tolerance. • 1083 hours, or 45 days to re-index 8 billion pages • The ST model is easy for parallel implementation while the parallel implementation of the HF model is non-trivial DCWC 2006
Conclusions of this Study • Generally speaking, relevance propagation can boost the performance of web information retrieval. • Sitemap-based propagation models outperform hyperlink-based propagation models in terms of both effectiveness and efficiency. Notably, sitemap-based propagation can be implemented in parallel. • Score-level propagation and feature-level propagation have almost similar effectiveness. Although the former is more efficient in on-line implementations, it is not practical for real-world search engines because it can not be implemented offline. • Overall speaking, sitemap-based feature propagation model is the best choice for real search engines. DCWC 2006
Thanks! tyliu@microsoft.com http://research.microsoft.com/users/tyliu/