190 likes | 418 Views
A Generalized Co-HITS Algorithm and Its Application to Bipartite Graphs. Hongbo Deng, Michael R. Lyu and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong July 1st , 2009. Link Analysis. IR Models. for. for. - HITS. - VSM. - PageRank.
E N D
A Generalized Co-HITS Algorithm and Its Application to Bipartite Graphs Hongbo Deng, Michael R. Lyu and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong July1st, 2009
Link Analysis IR Models for for - HITS - VSM - PageRank - Language Model - etc. - etc. - Personalized PageRank (PPR) - Linear Combination - etc. Introduction Many data can be modeled as bipartite graphs Content Graph Relevance Semantic relations Incorporate Content with Graph
An Illustration google mapquest mapquest map quest mapquest google.com google united states map map quest united states map map of florida weather mapquest.com us map world map Query suggestion for query “map”: • Noisy link data • Lack of relevance constraints More reasonable HITS PPR
Outline • Introduction • Generalized Co-HITS • Preliminaries • Iterative Framework • Regularization Framework • Experiments • Conclusion
Preliminaries Content Graph X Y Explicit links: Hidden links:
Initial scores Score propagation Generalized Co-HITS • Basic idea • Incorporate the bipartite graph with the content information from both sides • Initialize the vertices with the relevance scores x0, y0 • Propagate the scores (mutual reinforcement)
Generalized Co-HITS • Iterative framework
Smoothness Fit initial scores Iterative Regularization Framework • Consider the vertices on one side
Generalized Co-HITS • Regularization Framework R2 R1 R3 Wuu Wvv Intuition: the highly connected vertices are most likely to have similar relevance scores.
Generalized Co-HITS • Regularization Framework The cost function: Optimization problem: Solution:
Application to Query-URL Bipartite Graphs • Bipartite graph construction • Edge weighted by the click frequency • Normalize to obtain the transition matrix • Overall Algorithm
Outline • Introduction • Preliminaries • Generalized Co-HITS • Iterative Framework • Regularization Framework • Experiments • Conclusion
Experimental Evaluation • Data collection • AOL query log data • Cleaning the data • Removing the queries that appear less than 2 times • Combining the near-duplicated queries • 883,913 queries and 967,174 URLs • 4,900,387 edges • 250,127 unique terms
Evaluation: ODP Similarity • A simple measure of similarity among queries using ODP categories (query category) • Definition: • Example: • Q1: “United States” “Regional > North America > United States” • Q2: “National Parks” “Regional > North America > United States > Travel and Tourism > National Parks and Monuments” • Precision at rank n (P@n): • 300 distinct queries 3/5
Experimental Results • Comparison of Iterative Framework personalized PageRank one-step propagation general Co-HITS Result 1: The improvements of OSP and CoIter over the baseline (the dashed line) are promising when compared to the PPR. The initial relevance scores from both sides provide valuable information.
Experimental Results • Comparison of Regularization Framework single-sided regularization double-sided regularization Result 2: SiRegu can improve the performance over the baseline. CoRegu performs better than SiRegu, which owes to the newly developed cost function R3. Moreover, CoRegu is relatively robust.
Experimental Results • Detailed Results Result 3: The CoRegu-0.5 achieves the best performance. It is very essential and promising to consider the double-sided regularization framework for the bipartite graph.
Conclusions • Propose the Co-HITS algorithm to incorporate the bipartite graph with the content information from both sides. • The Co-HITS algorithm is more general, which includes HITS and personalized PageRank as special cases. • The CoRegu is more robust with the newly developed cost function, which achieves the best performance with consistent and promising improvements.
Q&A Thanks!