460 likes | 789 Views
Link Analysis: PageRank. Ranking Nodes on the Graph. vs. Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu Since there is large diversity in the connectivity of the web graph we can rank the pages by the link structure. Link Analysis Algorithms.
E N D
Ranking Nodes on the Graph vs. • Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu • Since there is large diversity in the connectivity of the web graph we can rank the pages by the link structure Slides by Jure Leskovec: Mining Massive Datasets
Link Analysis Algorithms • We will cover the following Link Analysis approaches to computing importances of nodes in a graph: • Page Rank • Hubs and Authorities (HITS) • Topic-Specific (Personalized) Page Rank • Web Spam Detection Algorithms Slides by Jure Leskovec: Mining Massive Datasets
Links as Votes • Idea:Links as votes • Page is more important if it has more links • In-coming links? Out-going links? • Think of in-links as votes: • www.stanford.edu has 23,400 inlinks • www.joe-schmoe.com has 1 inlink • Are all in-links are equal? • Links from important pages count more • Recursive question! Slides by Jure Leskovec: Mining Massive Datasets
Simple Recursive Formulation p • Each link’s vote is proportional to the importance of its source page • If page p with importance x has n out-links, each link gets x/n votes • Page p’s own importance is the sum of the votes on its in-links Slides by Jure Leskovec: Mining Massive Datasets
PageRank: The “Flow” Model The web in 1839 y/2 y a/2 y/2 m a m a/2 Flow equations: ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2 A “vote” from an important page is worth more A page is important if it is pointed to by other important pages Define a “rank” rj for node j Slides by Jure Leskovec: Mining Massive Datasets
Solving the Flow Equations Flow equations: ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2 • 3 equations, 3 unknowns, no constants • No unique solution • Additional constraint forces uniqueness • ry+ ra + rm = 1 • ry = 2/5, ra = 2/5, rm = 1/5 • Gaussian elimination method works for small examples, but we need a better method for large web-size graphs Slides by Jure Leskovec: Mining Massive Datasets
PageRank: Matrix Formulation • Stochastic adjacency matrix M • Let page j has djout-links • If j → i, then Mij= 1/djelse Mij = 0 • M is a column stochastic matrix • Columns sum to 1 • Rank vector r: vector with an entry per page • ri is the importance score of page i • iri = 1 • The flow equations can be written r = M r Slides by Jure Leskovec: Mining Massive Datasets
j i i = 1/3 M r Example r Suppose page j links to 3 pages, including i Slides by Jure Leskovec: Mining Massive Datasets
Eigenvector Formulation • The flow equations can be writtenr = M ∙ r • So the rank vector is an eigenvector of the stochastic web matrix • In fact, its first or principal eigenvector, with corresponding eigenvalue 1 Slides by Jure Leskovec: Mining Massive Datasets
r = Mr y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m Example: Flow Equations & M y a m ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2 Slides by Jure Leskovec: Mining Massive Datasets
Power Iteration Method di …. out-degree of node i • Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks • Power iteration: a simple iterative scheme • Suppose there are N web pages • Initialize: r(0) = [1/N,….,1/N]T • Iterate: r(t+1) = M ∙ r(t) • Stop when |r(t+1) – r(t)|1 < • |x|1 = 1≤i≤N|xi| is the L1 norm • Can use any other vector norm e.g., Euclidean Slides by Jure Leskovec: Mining Massive Datasets
PageRank: How to solve? y a m ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2 Iteration 0, 1, 2, … • Power Iteration: • Set /N • And iterate • ri=jMij∙rj • Example: ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 Slides by Jure Leskovec: Mining Massive Datasets
Random Walk Interpretation i1 i2 i3 j • Imagine a random web surfer: • At any time t, surfer is on some page u • At time t+1, the surfer follows an out-link from uuniformly at random • Ends up on some page vlinked from u • Process repeats indefinitely • Let: • p(t)… vector whose ithcoordinate is the prob. that the surfer is at page iat time t • p(t)is a probability distribution over pages Slides by Jure Leskovec: Mining Massive Datasets
The Stationary Distribution i1 i2 i3 j • Where is the surfer at time t+1? • Follows a link uniformly at random p(t+1) = M · p(t) • Suppose the random walk reaches a state p(t+1) = M · p(t) = p(t) then p(t)is stationary distributionof a random walk • Our rank vectorr satisfies r = M · r • So, it is a stationary distribution for the random walk Slides by Jure Leskovec: Mining Massive Datasets
PageRank: Three Questions or equivalently • Does this converge? • Does it converge to what we want? • Are results reasonable? Slides by Jure Leskovec: Mining Massive Datasets
Does This Converge? a b = Iteration 0, 1, 2, … Example: ra 1 0 1 0 rb 0 1 0 1 Slides by Jure Leskovec: Mining Massive Datasets
Does it Converge to What We Want? a b = Iteration 0, 1, 2, … Example: ra 1 0 0 0 rb 0 1 0 0 Slides by Jure Leskovec: Mining Massive Datasets
Problems with the “Flow” Model 2 problems: • Some pages are “dead ends” (have no out-links) • Such pages cause importanceto “leak out” • Spider traps (all out links arewithin the group) • Eventually spider traps absorb all importance Slides by Jure Leskovec: Mining Massive Datasets
Problem: Spider Traps y a m ry = ry/2 + ra /2 ra = ry/2 rm = ra /2 + rm Iteration 0, 1, 2, … • Power Iteration: • Set • And iterate • Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 3/6 7/12 16/24 1 Slides by Jure Leskovec: Mining Massive Datasets
Solution: Random Teleports y y a a m m • The Google solution for spider traps: At each time step, the random surfer has two options: • With probability , follow a link at random • With probability 1-, jump to some page uniformly at random • Common values for are in the range 0.8 to 0.9 • Surfer will teleport out of spider trap within a few time steps Slides by Jure Leskovec: Mining Massive Datasets
Problem: Dead Ends y a m ry = ry/2 + ra /2 ra = ry/2 rm = ra /2 Iteration 0, 1, 2, … • Power Iteration: • Set • And iterate • Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 1/6 1/12 2/24 0 Slides by Jure Leskovec: Mining Massive Datasets
Solution: Dead Ends y y a a m m • Teleports:Follow random teleport links with probability 1.0 from dead-ends • Adjust matrix accordingly Slides by Jure Leskovec: Mining Massive Datasets
Why Teleports Solve the Problem? Markov Chains Set of states X Transition matrix P where Pij = P(Xt=i | Xt-1=j) π specifying the probability of being at eachstate x X Goal is to find π such that π = P π Slides by Jure Leskovec: Mining Massive Datasets
Why is This Analogy Useful? Theory of Markov chains Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a uniquepositive stationary vector as long as P is stochastic, irreducibleand aperiodic. Slides by Jure Leskovec: Mining Massive Datasets
Make M Stochastic y a m • ai…=1 if node i has out deg 0, =0 else • 1…vector of all 1s ry = ry/2 + ra /2 + rm /3 • ra = ry/2+ rm /3 • rm = ra /2 + rm /3 Stochastic: Every column sums to 1 A possible solution: Add green links Slides by Jure Leskovec: Mining Massive Datasets
Make M Aperiodic y a m A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k. A possible solution: Add green links Slides by Jure Leskovec: Mining Massive Datasets
Make M Irreducible y a m From any state, there is a non-zero probability of going from any one state to any another A possible solution: Add green links Slides by Jure Leskovec: Mining Massive Datasets
Solution: Random Jumps From now on: We assume M has no dead endsThat is, we follow random teleport links with probability 1.0 from dead-ends di … out-degree of node i • Google’s solution that does it all: • Makes M stochastic, aperiodic, irreducible • At each step, random surfer has two options: • With probability 1-, follow a link at random • With probability , jump to some random page • PageRank equation [Brin-Page, 98] Slides by Jure Leskovec: Mining Massive Datasets
The Google Matrix • PageRank equation [Brin-Page, 98] • The Google Matrix A: • G is stochastic, aperiodic and irreducible, so • What is ? • In practice =0.85 (make 5 steps and jump) Slides by Jure Leskovec: Mining Massive Datasets
Random Teleports ( = 0.8) 1/n·1·1T S 0.8·½+0.2·⅓ 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y + 0.2 0.8 0.2·⅓ 0.8·½+0.2·⅓ a 0.8·½+0.2·⅓ m 0.2· ⅓ y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8+0.2·⅓ 0.8·½+0.2·⅓ 0.2·⅓ 0.2· ⅓ A y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . . Slides by Jure Leskovec: Mining Massive Datasets
Computing Page Rank • A = ∙M + (1-) [1/N]NxN ½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 A = 0.8 +0.2 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 = • Key step is matrix-vector multiplication • rnew = A ∙ rold • Easy if we have enough main memory to hold A, rold, rnew • Say N = 1 billion pages • We need 4 bytes for each entry (say) • 2 billion entries for vectors, approx 8GB • Matrix A has N2 entries • 1018 is a large number! Slides by Jure Leskovec: Mining Massive Datasets
Matrix Formulation • Suppose there are N pages • Consider a page j, with set of out-links dj • We have Mij = 1/|dj| when j→i and Mij = 0 otherwise • The random teleport is equivalent to • Adding a teleport link from j to every other page with probability (1-)/N • Reducing the probability of following each out-link from 1/|dj| to /|dj| • Equivalent: Tax each page a fraction (1-) of its score and redistribute evenly Slides by Jure Leskovec: Mining Massive Datasets
Rearranging the Equation [x]N… a vector of length N with all entries x , where since So we get: Slides by Jure Leskovec: Mining Massive Datasets
Sparse Matrix Formulation • We just rearranged the PageRank equation • where [(1-)/N]N is a vector with all N entries (1-)/N • M is a sparse matrix! • 10 links per node, approx 10N entries • So in each iteration, we need to: • Compute rnew = M ∙ rold • Add a constant value (1-)/N to each entry in rnew Slides by Jure Leskovec: Mining Massive Datasets
Sparse Matrix Encoding source node degree destination nodes • Encode sparse matrix using only nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk Slides by Jure Leskovec: Mining Massive Datasets
Basic Algorithm: Update Step Initialize all entries of rnew to (1-)/N For each page p(of out-degree n): Read into memory: p, n, dest1,…,destn, rold(p) for j = 1…n: rnew(destj) += rold(p) / n rold rnew 0 src degree destination 0 1 1 2 2 3 3 4 4 5 5 Slides by Jure Leskovec: Mining Massive Datasets • Assume enough RAM to fit rnew into memory • Store rold and matrix M on disk • Then 1 step of power-iteration is: 6 6
Analysis • Assume enough RAM to fit rnew into memory • Store rold and matrix M on disk • In each iteration, we have to: • Read rold and M • Write rnew back to disk • IO cost = 2|r| + |M| • Question: • What if we could not even fit rnew in memory? Slides by Jure Leskovec: Mining Massive Datasets
Block-based Update Algorithm rold rnew src degree destination 0 0 1 1 2 3 2 4 3 5 4 5 Slides by Jure Leskovec: Mining Massive Datasets
Analysis of Block Update • Similar to nested-loop join in databases • Break rnew into k blocks that fit in memory • Scan M and rold once for each block • k scans of M and rold • k(|M| + |r|) + |r| = k|M| + (k+1)|r| • Can we do better? • Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration Slides by Jure Leskovec: Mining Massive Datasets
Block-Stripe Update Algorithm src degree destination rnew 0 rold 1 0 1 2 3 2 4 3 5 4 5 Slides by Jure Leskovec: Mining Massive Datasets
Block-Stripe Analysis • Break M into stripes • Each stripe contains only destination nodes in the corresponding block of rnew • Some additional overhead per stripe • But it is usually worth it • Cost per iteration • |M|(1+) + (k+1)|r| Slides by Jure Leskovec: Mining Massive Datasets
Some Problems with Page Rank • Measures generic popularity of a page • Biased against topic-specific authorities • Solution: Topic-Specific PageRank (next) • Uses a single measure of importance • Other models e.g., hubs-and-authorities • Solution: Hubs-and-Authorities (next) • Susceptible to Link spam • Artificial link topographies created in order to boost page rank • Solution:TrustRank (next) Slides by Jure Leskovec: Mining Massive Datasets