790 likes | 980 Views
Ranking. Motivation : searching the web. Motivation: Basis for recommendation. Motivation: Personalized search. Why graphs?. The underlying data is naturally a graph WWW graph: Websites are linked with each other Papers linked by citation Authors linked by co-authorship
E N D
Why graphs? • The underlying data is naturally a graph • WWW graph: Websites are linked with each other • Papers linked by citation • Authors linked by co-authorship • Bipartite graph of customers and products • Friendship networks: who knows whom
What are we looking for • Rank nodes for a particular query • Top k matches for “Random Walks” from Citeseer • Who are the most likely co-authors of “Manuel Blum”. • Top k book recommendations for Purna from Amazon • Top k websites matching “Sound of Music” • Top k friend recommendations for Purna when she joins “Facebook”
How do search engines decide how to rank your query results? • How does Google rank the query results? • How would you do it?
Naïve ranking of query results • Given query q • Rank the web pages pin your database based on some similarity measure sim(p,q) • Scenarios where this is not a good idea?
Why Link Analysis? • First generation search engines • view documents as flat text files • could not cope with size or user needs • Second generation search engines • ranking becomes critical • use of Web specific data: Link Analysis • shift from relevance to authoritativeness
Link Analysis: Intuition • A link from page p to page q denotes endorsement • page p considers page q an authority on a subject • “mine” the web graph of recommendations • assign an authority value to every page
Link Analysis Ranking (LAR) Algorithms • Start with a collection of web pages • Extract the underlying hyperlink graph • Run the LAR algorithm on the graph • Output: an authority weight for each node w w w w w
Two Types of Algorithms • Query dependent: rank a small subset of pages related to a specific query • HITS (Kleinberg 98) was proposed as query dependent • Query independent: rank the whole Web • PageRank (Brin and Page 98) was proposed as query independent
Query-dependent LAR • Given a query q, find a subset of web pages S that are related to q • Rank the pages in S based on some ranking criterion
Query-dependent input Root set: includes pages relevant to the query q Root Set
Query-dependent input Root set: includes pages relevant to the query q Root Set OUT IN
Query-dependent input Root set: includes pages relevant to the query q Root Set OUT IN
Query-dependent input Base Set Root Set OUT IN
Properties of a good base set S • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities
How to construct a good seed set S • Given a query q: • collect the t highest-ranked pages for q from a text-based search engine to form set Γ • S = Γ • add to S all the pages pointing to Γ • add to S all the pages that pages from Γpoint to
Link Filtering • Navigational links: serve the purpose of moving within a site (or to related sites) • www.espn.com→ www.espn.com/nba • www.yahoo.com→ www.yahoo.it • Filter out navigational links • same domain name • same IP address
Hubs and Authorities [K98] • Based on the relationship between • authorities (pages) for a topic and • hubs (pages) that link to many related authorities • We observe that: • a certain natural type of equilibrium exists between hubs and authorities in the graph defined by the link structure • We exploit this equilibrium to develop an algorithm that identifies both types of pages simultaneously
Hubs and Authorities [K98] • Authority is not necessarily transferred directly between authorities • Pages have double identity • hub identity • authority identity • Good hubs point to good authorities • Good authorities are pointed by good hubs authorities hubs
HITS Algorithm • Initialize all weights to 1 • Repeat until convergence • O operation : hubs collect the weight of the authorities • Ioperation: authorities collect the weight of the hubs • Normalize weights under some norm • Why do we need normalization?
1 1 1 1 HITS and eigenvectors • LetAbe the adjacencymatrix of the Link Analysis graph A(i , j) = weight on the edge from i to j Adjacency matrix A
HITS and eigenvectors • Let atbe the authority weight vector and htthe hub weight vector at iteration t • Then in vector terms we have: at = ATht-1 and ht = Aat hence: at = ATA at-1and ht = AATht-1 • The authority weight vector a is the strongesteigenvector of ATA • The hub weight vector h is the strongesteigenvector of AAT
Singular Value Decomposition • Linear trendvin matrix A: • the tendency of the row vectors of A to align with vector v • SVD can be used to discover the linear trends in the data • vi : the i-th strongest linear trend • λi: the strength of the i-th strongest linear trend λ2 v2 v1 λ1 • HITS discovers the strongest linear trend in the authority space
[n×r] [r×r] [r×n] Singular Value Decomposition • r: rank of matrix A • λ1≥ λ2≥ … ≥λr: singular values (square roots of eig-valsAAT, ATA) • : left singular vectors (eig-vectors of AAT) • : right singular vectors (eig-vectors of ATA)
HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a
HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 1 1 1 1 1 1
HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 3 3 3 3 3
HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 32 32 32 3∙2 3∙2 3∙2
HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 33 33 33 32 ∙ 2 32 ∙ 2
HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 34 34 34 32∙ 22 32∙ 22 32∙ 22
HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 32n 32n after n iterations weight of node p is proportional to the number of paths that leave node p 32n 3n∙ 2n 3n∙ 2n 3n∙ 2n
HITS and the TKC effect • The HITS algorithm favors the most dense community of hubs and authorities • Tightly Knit Community (TKC) effect h a 1 1 after normalization with the max element as n → ∞ 1 0 0 0
Strengths and weaknesses of HITS • Strength: its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages • Weaknesses: • Can be easily spammed: adding out-links in one’s own page is so easy • Topic drift:many pages in the expanded set may not be on topic • Inefficiency at query time: collecting the root set, expanding it, and performing eigenvector computation are all expensive operations
Query-independent LAR • Have an apriori ordering of the web pages • Q: Set of pages that contain the keywords in the query q • Present the pages in Q ordered according to order π
InDegree algorithm • Rank pages according to in-degree • wi = |B(i)| where B(i) is the number of incoming links to node i w=3 w=2 • Red Page • Yellow Page • Blue Page • Purple Page • Green Page w=2 w=1 w=1
Query-independent LAR • In-degree is a local measure • All links to a page are considered equal, regardless of where they come from • Two pages with the same in-degree are considered equally important, even if one is cited by more prestigious sources than the other
PageRank algorithm [BP98] • Good authorities should be pointed by good authorities • Random walk on the web graph • pick a page at random • with probability 1 - α jump to a random page • with probability αfollow a random outgoing link • PageRank of page p: • F(q): the set of outgoinglinks of page q • Red Page • Purple Page • Yellow Page • Blue Page • Green Page
1 1/2 1 1/2 What is a random walk t=0
1 1 1/2 1/2 1 1 1/2 1/2 t=1 t=0
1 1 1 1/2 1/2 1/2 1 1 1 1/2 1/2 1/2 t=1 t=0 t=2
1 1 1 1 1/2 1/2 1/2 1/2 1 1 1 1 1/2 1/2 1/2 1/2 t=1 t=0 t=2 t=3
2 1 1/2 1 1 1/2 3 Markov chains • A Markov chain describes a discrete time stochastic process over a set of states according to a transition probability matrix • Pij = probability of moving to state j when at state i • ∑jPij = 1 (stochastic matrix) • From any node the probability of moving out is 1 • Memorylessness property: The next state of the chain depends only at the current state and not on the past of the process S = {s1, s2, … sn} P = {Pij} Transition matrix P 0 0 1/2 1 0 1/2 0 1 0
Random walks • Random walks on graphs correspond to Markov Chains • The set of states S is the set of nodes of the graph G • The transition probability matrix is the probability that we follow an edge from one node to another
An example v2 v1 v3 v5 v4
State probability vector • Define a vector qt = (qt1,qt2, … ,qtn) that stores the probability of being at state i at time t • q0i = the probability of starting from state i
An example v2 v1 v3 qt+11 = 1/3 qt4+ 1/2 qt5 qt+12 = 1/2 qt1 + qt3+ 1/3 qt4 v5 v4 qt+13= 1/2 qt1 + 1/3 qt4 qt+14 = 1/2 qt5 qt = qt-1P qt+15 = qt2
Stationary Distribution • When the surfer keeps walking for a long time • Then the distribution does not change anymore • i.e., qt= qt-1 = π • We will call πthe stationary distribution • For “well-behaved” graphs this does not depend on the start distribution!!