1 / 79

Ranking

Ranking. Motivation : searching the web. Motivation: Basis for recommendation. Motivation: Personalized search. Why graphs?. The underlying data is naturally a graph WWW graph: Websites are linked with each other Papers linked by citation Authors linked by co-authorship

irish
Download Presentation

Ranking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking

  2. Motivation: searching the web

  3. Motivation: Basis for recommendation

  4. Motivation: Personalized search

  5. Why graphs? • The underlying data is naturally a graph • WWW graph: Websites are linked with each other • Papers linked by citation • Authors linked by co-authorship • Bipartite graph of customers and products • Friendship networks: who knows whom

  6. What are we looking for • Rank nodes for a particular query • Top k matches for “Random Walks” from Citeseer • Who are the most likely co-authors of “Manuel Blum”. • Top k book recommendations for Purna from Amazon • Top k websites matching “Sound of Music” • Top k friend recommendations for Purna when she joins “Facebook”

  7. How do search engines decide how to rank your query results? • How does Google rank the query results? • How would you do it?

  8. Naïve ranking of query results • Given query q • Rank the web pages pin your database based on some similarity measure sim(p,q) • Scenarios where this is not a good idea?

  9. Why Link Analysis? • First generation search engines • view documents as flat text files • could not cope with size or user needs • Second generation search engines • ranking becomes critical • use of Web specific data: Link Analysis • shift from relevance to authoritativeness

  10. Link Analysis: Intuition • A link from page p to page q denotes endorsement • page p considers page q an authority on a subject • “mine” the web graph of recommendations • assign an authority value to every page

  11. Link Analysis Ranking (LAR) Algorithms • Start with a collection of web pages • Extract the underlying hyperlink graph • Run the LAR algorithm on the graph • Output: an authority weight for each node w w w w w

  12. Two Types of Algorithms • Query dependent: rank a small subset of pages related to a specific query • HITS (Kleinberg 98) was proposed as query dependent • Query independent: rank the whole Web • PageRank (Brin and Page 98) was proposed as query independent

  13. Query-dependent LAR • Given a query q, find a subset of web pages S that are related to q • Rank the pages in S based on some ranking criterion

  14. Query-dependent input Root set: includes pages relevant to the query q Root Set

  15. Query-dependent input Root set: includes pages relevant to the query q Root Set OUT IN

  16. Query-dependent input Root set: includes pages relevant to the query q Root Set OUT IN

  17. Query-dependent input Base Set Root Set OUT IN

  18. Properties of a good base set S • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities

  19. How to construct a good seed set S • Given a query q: • collect the t highest-ranked pages for q from a text-based search engine to form set Γ • S = Γ • add to S all the pages pointing to Γ • add to S all the pages that pages from Γpoint to

  20. Link Filtering • Navigational links: serve the purpose of moving within a site (or to related sites) • www.espn.com→ www.espn.com/nba • www.yahoo.com→ www.yahoo.it • Filter out navigational links • same domain name • same IP address

  21. Hubs and Authorities [K98] • Based on the relationship between • authorities (pages) for a topic and • hubs (pages) that link to many related authorities • We observe that: • a certain natural type of equilibrium exists between hubs and authorities in the graph defined by the link structure • We exploit this equilibrium to develop an algorithm that identifies both types of pages simultaneously

  22. Hubs and Authorities [K98] • Authority is not necessarily transferred directly between authorities • Pages have double identity • hub identity • authority identity • Good hubs point to good authorities • Good authorities are pointed by good hubs authorities hubs

  23. HITS Algorithm • Initialize all weights to 1 • Repeat until convergence • O operation : hubs collect the weight of the authorities • Ioperation: authorities collect the weight of the hubs • Normalize weights under some norm • Why do we need normalization?

  24. 1 1 1 1 HITS and eigenvectors • LetAbe the adjacencymatrix of the Link Analysis graph A(i , j) = weight on the edge from i to j Adjacency matrix A

  25. HITS and eigenvectors • Let atbe the authority weight vector and htthe hub weight vector at iteration t • Then in vector terms we have: at = ATht-1 and ht = Aat hence: at = ATA at-1and ht = AATht-1 • The authority weight vector a is the strongesteigenvector of ATA • The hub weight vector h is the strongesteigenvector of AAT

  26. Singular Value Decomposition • Linear trendvin matrix A: • the tendency of the row vectors of A to align with vector v • SVD can be used to discover the linear trends in the data • vi : the i-th strongest linear trend • λi: the strength of the i-th strongest linear trend λ2 v2 v1 λ1 • HITS discovers the strongest linear trend in the authority space

  27. [n×r] [r×r] [r×n] Singular Value Decomposition • r: rank of matrix A • λ1≥ λ2≥ … ≥λr: singular values (square roots of eig-valsAAT, ATA) • : left singular vectors (eig-vectors of AAT) • : right singular vectors (eig-vectors of ATA)

  28. HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a

  29. HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 1 1 1 1 1 1

  30. HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 3 3 3 3 3

  31. HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 32 32 32 3∙2 3∙2 3∙2

  32. HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 33 33 33 32 ∙ 2 32 ∙ 2

  33. HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 34 34 34 32∙ 22 32∙ 22 32∙ 22

  34. HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 32n 32n after n iterations weight of node p is proportional to the number of paths that leave node p 32n 3n∙ 2n 3n∙ 2n 3n∙ 2n

  35. HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 1 1 after normalization with the max element as n → ∞ 1 0 0 0

  36. Strengths and weaknesses of HITS • Strength: its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages • Weaknesses: • Can be easily spammed: adding out-links in one’s own page is so easy • Topic drift:many pages in the expanded set may not be on topic • Inefficiency at query time: collecting the root set, expanding it, and performing eigenvector computation are all expensive operations

  37. Query-independent LAR • Have an apriori ordering of the web pages • Q: Set of pages that contain the keywords in the query q • Present the pages in Q ordered according to order π

  38. InDegree algorithm • Rank pages according to in-degree • wi = |B(i)| where B(i) is the number of incoming links to node i w=3 w=2 • Red Page • Yellow Page • Blue Page • Purple Page • Green Page w=2 w=1 w=1

  39. Query-independent LAR • In-degree is a local measure • All links to a page are considered equal, regardless of where they come from • Two pages with the same in-degree are considered equally important, even if one is cited by more prestigious sources than the other

  40. PageRank algorithm [BP98] • Good authorities should be pointed by good authorities • Random walk on the web graph • pick a page at random • with probability 1 - α jump to a random page • with probability αfollow a random outgoing link • PageRank of page p: • F(q): the set of outgoinglinks of page q • Red Page • Purple Page • Yellow Page • Blue Page • Green Page

  41. 1 1/2 1 1/2 What is a random walk t=0

  42. 1 1 1/2 1/2 1 1 1/2 1/2 t=1 t=0

  43. 1 1 1 1/2 1/2 1/2 1 1 1 1/2 1/2 1/2 t=1 t=0 t=2

  44. 1 1 1 1 1/2 1/2 1/2 1/2 1 1 1 1 1/2 1/2 1/2 1/2 t=1 t=0 t=2 t=3

  45. 2 1 1/2 1 1 1/2 3 Markov chains • A Markov chain describes a discrete time stochastic process over a set of states according to a transition probability matrix • Pij = probability of moving to state j when at state i • ∑jPij = 1 (stochastic matrix) • From any node the probability of moving out is 1 • Memorylessness property: The next state of the chain depends only at the current state and not on the past of the process S = {s1, s2, … sn} P = {Pij} Transition matrix P 0 0 1/2 1 0 1/2 0 1 0

  46. Random walks • Random walks on graphs correspond to Markov Chains • The set of states S is the set of nodes of the graph G • The transition probability matrix is the probability that we follow an edge from one node to another

  47. An example v2 v1 v3 v5 v4

  48. State probability vector • Define a vector qt = (qt1,qt2, … ,qtn) that stores the probability of being at state i at time t • q0i = the probability of starting from state i

  49. An example v2 v1 v3 qt+11 = 1/3 qt4+ 1/2 qt5 qt+12 = 1/2 qt1 + qt3+ 1/3 qt4 v5 v4 qt+13= 1/2 qt1 + 1/3 qt4 qt+14 = 1/2 qt5 qt = qt-1P qt+15 = qt2

  50. Stationary Distribution • When the surfer keeps walking for a long time • Then the distribution does not change anymore • i.e., qt= qt-1 = π • We will call πthe stationary distribution • For “well-behaved” graphs this does not depend on the start distribution!!

More Related