1 / 56

Local Approximation of PageRank and Reverse PageRank

Local Approximation of PageRank and Reverse PageRank. Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08. Review of PageRank Local PageRank approximation Algorithm Lower bounds PageRank vs. Reverse PageRank Applications of Reverse PageRank. Overview. PageRank.

garren
Download Presentation

Local Approximation of PageRank and Reverse PageRank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Local Approximation of PageRank and Reverse PageRank Li-Tal Mashiach Advisor: Dr. Ziv Bar-Yossef 13/03/08

  2. Review of PageRank • Local PageRank approximation • Algorithm • Lower bounds • PageRank vs. Reverse PageRank • Applications of Reverse PageRank Overview

  3. PageRank Most search engines analyze the hyperlink structure to order search results PageRank • Important measure of ranking for all major search engines

  4. Base rank Sum of the in-neighbors’ ranks Review of PageRank Rank divided among all out-neighbors Damping factor

  5. A random surfer is visiting the web: • With probability , selects a random out-link • With probability jumps to a random web page PageRank as a Random Walk

  6. Run power method • Initialize: • Repeat until convergence: • Challenges: • Holding the whole web graph • Multiplying a matrix by a vector Global PageRank Computation

  7. Local PR Approximation Global PR calculates PR to all pages Sometime we are interested in the PR of a small number of pages • Person interested in the PR of his homepage • Online business is interested in the PR of his own website and his competitors’ website Do we need to calculate the PR of the whole graph for that?

  8. Given: local access to a directed graph G and target node • Output: PR(u) • local access: • Cost: Number of queries to the link server Link Server Problem Statement[Chen, Gan, Suel, 2004]

  9. Overview Review of PageRank Local PageRank approximation Algorithm Lower bounds PageRank vs. Reverse PageRank Applications of Reverse PageRank

  10. inft(v,u) – the fraction of the PR score of v that flows to u on paths of length t v u2 u1 Another Characterization of PR[Jeh, Widom, 2003] u

  11. PRr(u) – PR score that flows into u from nodes at distance at most r from u Theorem: v u2 u1 Another Characterization of PR[Jeh, Widom, 2003] u

  12. Local PR Brute Force Algorithm[Chen, Gan, Suel, 2004] • Goal: calculate PRr(u) for a sufficiently large r • Algorithm: • Crawl backwards the sub-graph of radius r around u • For each node v at layer t calculate the inft(v,u) • Sum up the weighted influence values v w1 w2 u

  13. Local PR Brute Force Algorithm u

  14. Optimization by Pruning Heuristic to improve the cost Prune all nodes whose influence is below some threshold Was shown empirically to be sometimes better [Chen, Gan, Suel, 2004] u

  15. Analysis of the Algorithm • This algorithm requires at most queries • r – number of iterations until the PR random walk almost converges • d – maximum in-degree of the graph • In case of slow PR convergence or high in-degree, the algorithm is not feasible

  16. In the web graph there are a lot of web pages with high in-degree • Conclusion: The algorithm is frequently unsuitable for the web graph • Is this a limitation of this specific algorithm only? Limitations of the Algorithm

  17. Lower Bounds • Local PR approx. is hard for graphs with: • High in-degree nodes • Slow convergence of the PR random walk

  18. Proof x1 x2 x3 xm • By reduction from the OR problem Input: Output: queries are needed even for randomized algorithms

  19. The Reduction 1 1 0 m X= Gx= … … …. … u • A - Alg. that calculates local PR • B - Alg. that computes the OR function

  20. The Reduction 1 1 0 m X= Gx= Claim 1: Let |x| be the number of 1’s in x. Then, … … …. … u Claim 2: When ,

  21. Proof Cont. • Given an input x, B simulates A on Gx, u • If PRx(u) ≥ p1 => OR=1 • If PRx(u) ≤ p0 => OR=0 • It means that the maximum number of queries A uses ≥

  22. Local PageRank approximation is frequently infeasible on the web graph Conclusion

  23. PageRank vs. Reverse PageRank • The local approximation algorithm should perform better on the Reverse Web Graph

  24. Experimental Setup 280,000 page crawl of the www.stanford.edu domain 22,000 page crawl of the www.cnn.com site

  25. Convergence Rate

  26. Crawl Growth Rate In-deg: 38,606Out-deg: 255

  27. Performance of the Algorithm

  28. Applications of Reverse PageRank Local RPR app. Novel app. TrustRank Influencers in social networks Hub web pages Measuring semantic relatedness Finding crawl seeds

  29. Influencers in Social Networks Goal: Market a new product to be adopted by a large fraction of a social network Method: • Initially target a few influential members • Trigger a word of mouth process • Results in a large number of users How should we choose these seed members?

  30. Nodes with high RPR • Have short paths to many other nodes in the network • Frequently the only gateways to these nodes Why RPR?[Java et al. 2006]

  31. Influencers in Social Networks

  32. Influencers in Social Networks 4-level BFS crawl 1-level BFS crawl www.Livejournal.com, 3.5 million nodes

  33. Hub Web Pages Goal: Find good starting points for search • Difficult to formulate queries • Broad search tasks • Need to understand the surrounding context Method: Find pages with short paths to many relevant pages

  34. High RPR pages tend to have short paths to many authorities Why RPR?[Fogaras, 2003]

  35. Hub Web Pages Fraction of hubs in the top 20 results for the queries: 1. “computer scientists” 2. “global warming” 3. “folk dancing” 4. “queen Elizabeth” Meta-search engine over Yahoo! search

  36. Measuring Semantic Relatedness Goal: Find the relatedness between two concepts • For Natural language processing applications Method: Use a taxonomy like the ODP or Wikipedia

  37. Why RPR? b is a strong sub-concept of a in a taxonomy if • there are many short paths from a to b RPR- measure of b as sub-concept of a RPR Similarity- two concepts will be similar in case they have significant overlap between their RPR vectors • similarity between the vectors RPRa and RPRb

  38. Measuring Semantic Relatedness Relatedness to “Einstein” Relatedness to “Computer” Agriculture Physics Prize Newton Isaac Internet 0.6 0.6 -0.4 www.dmoz.org taxonomy WordSimilarity-353

  39. Finding Crawl Seeds Goal: Discover quickly new content on the web while incurring as little overhead as possible • Overhead: old pages / new pages Method: Find good seeds

  40. A page p has high RPR if • Many pages are reachable from p by short paths • These pages are not reachable from many other pages u Known page Why RPR? v Unknown page

  41. Finding Crawl Seeds Fraction of new pages discovered Overhead WebBase project, two crawls of ~1,000,000 pages, one week apart 4-level BFS crawl

  42. Summary Two graph properties make local PageRank approximation hard The Web Graph is not suitable for local PR approximation The Reverse Web graph is suitable for local PR approximation RPR finds nodes that • have short paths to many other nodes • frequently the only gateways to these nodes Applications of RPR

  43. Thanks!

  44. Appendix

  45. Proof – High in-degree Deterministic algorithms x1 x2 x3 xm • By reduction from the majority-by-a-margin problem Input: Output: the majority At least queries are needed

  46. The Reduction 1 1 0 m X= Gx= W1 W2 Wm V1 V2 V3 u • A - Alg. that calculates local PR • B - Alg. that computes majority-by-a-margin

  47. The Reduction 1 1 0 m X= Claim 1: Let |x| be the number of 1’s in x. Then, Gx= W1 W2 Wm V1 V2 V3 u Claim 2: When ,

  48. Proof Cont. • Given an input x, B simulates A on Gx, u • If PRx(u) ≥ p1 => The majority bit of x is 1 • If PRx(u) ≤ p0 => The majority bit of x is 0 • It means that the maximum number of queries A uses ≥

  49. Proof – Slow PR Conversion Randomized algorithms x1 x2 x3 xm • By reduction from the OR problem Input: Output: queries are needed even for randomized algorithms

  50. The Reduction 0 1 0 m X= Gx= Sm S1 …… T • A - Alg. that calculates local PR • B - Alg. that computes the OR function u

More Related