Towards Scaling Fully Personalized PageRank

Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz Computer and Automation Research Institute of the Hungarian Academy of Sciences Budapest University of Technology and Economics

Problem formulation • PageRank(Brin,Page,’98) • PV PageRank vector, r uniform distribution vector • Overallquality measure of Web pages • Pre-computation: evaluate PV by power iteration • Query: order results by PV • Personalized PageRank(Brin,Page,’98) • r preference vector of a user, query dependent • PPV(r):=PVpersonalizedquality measure of Web pages • Pre-computation: r is not known. What to compute? • Query: power-iteration. 5 hours/query!!! Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Preliminaries • Linearity: • Full personalization • Pre-compute PPV(ri) for all pages • V2 disk, V(V+E)time, where V ≈ 109,E ≈1010, ??? • Topic-Sensitive PageRank (Haveliwala ’01) • Linearity • Pre-compute PPV(ri) for a topical basisr1,…,rk, k≈20 • Query: user submits a topic by • Query engine combines PPV(ri) vectors • Scaling Personalized Web Search (Jeh, Widom, ’03) • Decomposition, linearity • Pre-compute PPV(ri) for unit vectorsr1,…,rk, corresponding to k≈10.000 pages • Query: personalization over the 10.000 pages Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Towards full personalization • Our algorithm • Monte Carlo simulation, not power iteration • Pre-compute approximatePPV(ri) for all unit vectors r1,…,rk, k=number of pages • Scalability: quasi linear pre-computation & sub-linear query • Main points of this presentation • Outline of the algorithm • Pre-computation: external-memory, distributed • Query: used to increase precision • Error of approximation tends to zero exponentially • Exact vs. approximated PPV -- space lower bounds Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Outline of the Algorithm • Theorem (Jeh, Widom ’03, F ’03) • Random walk starts from page u • Uniform step with probability 1-c, stops with c • PPV(u,v)=Pr{ the walk stops at page v } • Monte Carlo algorithm • Pre-computation • From u simulateN independent random walks • Database of fingerprints: ending vertices of the walks from all vertices • Query • PPV(u,v) : = # ( walks u→v ) / N Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

External memory pre-computation • Goal: N independent random walks from each vertex • Input: webgraph V ≈ 109,E ≈ 1010 • V+E > memory • Accessing the edges • Edge scan --- stream access • Edges sorted by source vertices Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

External memory pre-computation (2) • Goal: N independent random walks from each vertex • Simulate all walks together • Iteration: 1 blink = 1 edge scan • Sort path ends • Merge with the sorted graph • Each walk stops with prob. c • E( #walks ) = (1-c)k∙N∙V • after k iterations Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Distributed indexing • M machines with fast local network connections • memory < V+E ≤ M∙(memory) Parallelize for N∙V walks Parts of the graph in RAM Remote transfers batched M=3 • Heuristic partition: one site to one machine • Machine1: www.cnn.com/*, Machine2: www.yahoo.com/* • Uniform load balance← ordinary PR distributed equally Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Query, increasing precision • Database of N∙V fingerprints (path endings) • Query: PPV(u) : =empirical distribution • from N samples • Theorem (Jeh, Widom, ’03) • O(u)denotes out-neighbors of u • Query: PPV(u) : =empirical distribution • from N∙|O(u)| samples • Number of fingerprints for a query • F = N∙(db accesses/query) Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Error of approximation • Exact:PPV(u,v) • Approximate by F fingerprints: PPV(u,v) • Theorem • If PPV(u,v) > PPV(u, holds, then • Pr{PPV(u,v) < PPV(u,w) } < exp( - 0.3∙N∙δ2 ) • Idea of the proof • N∙( PPV(u,v) - PPV(u,w) ) = #(u→v) - #(u→w) = • =sum of F iid. random variables with values {-1,0,1} • Bernstein’s inequality • Error of approximation→ 0exponentially with • F = (db size/vertex)∙(db accesses/query) → ∞ Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Exact versus approximate • Model of computation • Input: G graph with V vertices • Pre-compute a database of size D • Query: respond by accessing only the db. • Exact • Query: u,v,w • Decide if PPV(u,v) > PPV(u,w) holds • Approximate for fixedε and δ • Query: u,v,w • Decide if PPV(u,v) > PPV(u,w) holds with error probability ε when | PPV(u,v) - PPV(u,w) | > δ Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Lower bounds for the db size • For the webgraph V ≈109 • Theorem 1 • For the Exact problem D = Ω(V2) sized db is required in worst case • Theorem 2 • For the Approximate problem D = Ω(V) • Is it possible to improve the 2nd lower bound? • Our algorithm uses a D = O(V logV) sized db Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Idea of the lower bound proofs • One-way communication complexity • Bit-vector probing (BVP) • Theorem: B ≥ mfor any protocol • Reduction from Exact-PPVto BVP Alice has a bit vector Input: x = (x1, x2, …, xm) Bob has a number Input: 1 ≤ k ≤ m Xk = ? Communication B bits Alice has x = (x1, x2, …, xm) G graph with V vertices, where V2 = m Pre-compute an Exact PPV database of size D Bob has 1 ≤ k ≤ m u, v, w vertices PPV(u,v) ? PPV(u,w) Xk = ? Communication Exact PPV db, D bits Thus D = B ≥ m= V2 Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Summary • Fully personalized PR • Monte-Carlo method, not power iteration • Pre-computation • External-memory, distributed • Query • Increase precision by (db accesses/query) • Error of approximation • Tends to zero exponentially • Space lower bounds • Quadratic for Exact PPR • Linear for Approximate PPR

Thank you! Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Misc • N∙PPV(u,v) = #(u→v) = Binom(N,PPV(u,v)) • Claim (by Chernoff’s bound): • Pr{PPV(u,v) > (1+δ) PPV(u,v) } < • exp(-N∙PPV(u,v)∙δ2/4) • If for a protocol Pr{right answer} ≥ (1+γ) / 2 then B ≥ γ∙m • PV PageRank vector, c constant, M normalized adjacency matrix, Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

Towards Scaling Fully Personalized PageRank

Towards Scaling Fully Personalized PageRank

Presentation Transcript

PageRank

lecture pagerank

Local Approximation of PageRank and Reverse PageRank

Google’s PageRank

Pagerank

PageRank

Towards Personalized Genomics-Guided Cancer Immunotherapy

Fast Algorithms for Top-k Personalized PageRank Queries

PageRank

Perspectives of metabolomics towards personalized medicine

PageRank

Pagerank

28. PageRank

PageRank

PageRank

Scaling Personalized Web Search

FAST-PPR: Personalized PageRank Estimation for Large Graphs

Scaling up on AIDS: Towards universal access

Google PageRank

PageRank

PageRank

PageRank