170 likes | 380 Views
Towards Scaling Fully Personalized PageRank. D ániel Fogaras, Balázs Rácz. Computer and Automation Research Institute of the H ungarian Academy of Sciences. Budapest University of Technology and Economics. Problem formulation. PageRank (Brin,Page,’98)
E N D
Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz Computer and Automation Research Institute of the Hungarian Academy of Sciences Budapest University of Technology and Economics
Problem formulation • PageRank(Brin,Page,’98) • PV PageRank vector, r uniform distribution vector • Overallquality measure of Web pages • Pre-computation: evaluate PV by power iteration • Query: order results by PV • Personalized PageRank(Brin,Page,’98) • r preference vector of a user, query dependent • PPV(r):=PVpersonalizedquality measure of Web pages • Pre-computation: r is not known. What to compute? • Query: power-iteration. 5 hours/query!!! Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Preliminaries • Linearity: • Full personalization • Pre-compute PPV(ri) for all pages • V2 disk, V(V+E)time, where V ≈ 109,E ≈1010, ??? • Topic-Sensitive PageRank (Haveliwala ’01) • Linearity • Pre-compute PPV(ri) for a topical basisr1,…,rk, k≈20 • Query: user submits a topic by • Query engine combines PPV(ri) vectors • Scaling Personalized Web Search (Jeh, Widom, ’03) • Decomposition, linearity • Pre-compute PPV(ri) for unit vectorsr1,…,rk, corresponding to k≈10.000 pages • Query: personalization over the 10.000 pages Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Towards full personalization • Our algorithm • Monte Carlo simulation, not power iteration • Pre-compute approximatePPV(ri) for all unit vectors r1,…,rk, k=number of pages • Scalability: quasi linear pre-computation & sub-linear query • Main points of this presentation • Outline of the algorithm • Pre-computation: external-memory, distributed • Query: used to increase precision • Error of approximation tends to zero exponentially • Exact vs. approximated PPV -- space lower bounds Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Outline of the Algorithm • Theorem (Jeh, Widom ’03, F ’03) • Random walk starts from page u • Uniform step with probability 1-c, stops with c • PPV(u,v)=Pr{ the walk stops at page v } • Monte Carlo algorithm • Pre-computation • From u simulateN independent random walks • Database of fingerprints: ending vertices of the walks from all vertices • Query • PPV(u,v) : = # ( walks u→v ) / N Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
External memory pre-computation • Goal: N independent random walks from each vertex • Input: webgraph V ≈ 109,E ≈ 1010 • V+E > memory • Accessing the edges • Edge scan --- stream access • Edges sorted by source vertices Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
External memory pre-computation (2) • Goal: N independent random walks from each vertex • Simulate all walks together • Iteration: 1 blink = 1 edge scan • Sort path ends • Merge with the sorted graph • Each walk stops with prob. c • E( #walks ) = (1-c)k∙N∙V • after k iterations Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Distributed indexing • M machines with fast local network connections • memory < V+E ≤ M∙(memory) Parallelize for N∙V walks Parts of the graph in RAM Remote transfers batched M=3 • Heuristic partition: one site to one machine • Machine1: www.cnn.com/*, Machine2: www.yahoo.com/* • Uniform load balance← ordinary PR distributed equally Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Query, increasing precision • Database of N∙V fingerprints (path endings) • Query: PPV(u) : =empirical distribution • from N samples • Theorem (Jeh, Widom, ’03) • O(u)denotes out-neighbors of u • Query: PPV(u) : =empirical distribution • from N∙|O(u)| samples • Number of fingerprints for a query • F = N∙(db accesses/query) Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Error of approximation • Exact:PPV(u,v) • Approximate by F fingerprints: PPV(u,v) • Theorem • If PPV(u,v) > PPV(u, holds, then • Pr{PPV(u,v) < PPV(u,w) } < exp( - 0.3∙N∙δ2 ) • Idea of the proof • N∙( PPV(u,v) - PPV(u,w) ) = #(u→v) - #(u→w) = • =sum of F iid. random variables with values {-1,0,1} • Bernstein’s inequality • Error of approximation→ 0exponentially with • F = (db size/vertex)∙(db accesses/query) → ∞ Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Exact versus approximate • Model of computation • Input: G graph with V vertices • Pre-compute a database of size D • Query: respond by accessing only the db. • Exact • Query: u,v,w • Decide if PPV(u,v) > PPV(u,w) holds • Approximate for fixedε and δ • Query: u,v,w • Decide if PPV(u,v) > PPV(u,w) holds with error probability ε when | PPV(u,v) - PPV(u,w) | > δ Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Lower bounds for the db size • For the webgraph V ≈109 • Theorem 1 • For the Exact problem D = Ω(V2) sized db is required in worst case • Theorem 2 • For the Approximate problem D = Ω(V) • Is it possible to improve the 2nd lower bound? • Our algorithm uses a D = O(V logV) sized db Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Idea of the lower bound proofs • One-way communication complexity • Bit-vector probing (BVP) • Theorem: B ≥ mfor any protocol • Reduction from Exact-PPVto BVP Alice has a bit vector Input: x = (x1, x2, …, xm) Bob has a number Input: 1 ≤ k ≤ m Xk = ? Communication B bits Alice has x = (x1, x2, …, xm) G graph with V vertices, where V2 = m Pre-compute an Exact PPV database of size D Bob has 1 ≤ k ≤ m u, v, w vertices PPV(u,v) ? PPV(u,w) Xk = ? Communication Exact PPV db, D bits Thus D = B ≥ m= V2 Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Summary • Fully personalized PR • Monte-Carlo method, not power iteration • Pre-computation • External-memory, distributed • Query • Increase precision by (db accesses/query) • Error of approximation • Tends to zero exponentially • Space lower bounds • Quadratic for Exact PPR • Linear for Approximate PPR
Thank you! Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz
Misc • N∙PPV(u,v) = #(u→v) = Binom(N,PPV(u,v)) • Claim (by Chernoff’s bound): • Pr{PPV(u,v) > (1+δ) PPV(u,v) } < • exp(-N∙PPV(u,v)∙δ2/4) • If for a protocol Pr{right answer} ≥ (1+γ) / 2 then B ≥ γ∙m • PV PageRank vector, c constant, M normalized adjacency matrix, Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz