1 / 16

Towards Scaling Fully Personalized PageRank

Towards Scaling Fully Personalized PageRank. D ániel Fogaras, Balázs Rácz. Computer and Automation Research Institute of the H ungarian Academy of Sciences. Budapest University of Technology and Economics. Problem formulation. PageRank (Brin,Page,’98)

susane
Download Presentation

Towards Scaling Fully Personalized PageRank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz Computer and Automation Research Institute of the Hungarian Academy of Sciences Budapest University of Technology and Economics

  2. Problem formulation • PageRank(Brin,Page,’98) • PV PageRank vector, r uniform distribution vector • Overallquality measure of Web pages • Pre-computation: evaluate PV by power iteration • Query: order results by PV • Personalized PageRank(Brin,Page,’98) • r preference vector of a user, query dependent • PPV(r):=PVpersonalizedquality measure of Web pages • Pre-computation: r is not known. What to compute? • Query: power-iteration. 5 hours/query!!! Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  3. Preliminaries • Linearity: • Full personalization • Pre-compute PPV(ri) for all pages • V2 disk, V(V+E)time, where V ≈ 109,E ≈1010, ??? • Topic-Sensitive PageRank (Haveliwala ’01) • Linearity • Pre-compute PPV(ri) for a topical basisr1,…,rk, k≈20 • Query: user submits a topic by • Query engine combines PPV(ri) vectors • Scaling Personalized Web Search (Jeh, Widom, ’03) • Decomposition, linearity • Pre-compute PPV(ri) for unit vectorsr1,…,rk, corresponding to k≈10.000 pages • Query: personalization over the 10.000 pages Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  4. Towards full personalization • Our algorithm • Monte Carlo simulation, not power iteration • Pre-compute approximatePPV(ri) for all unit vectors r1,…,rk, k=number of pages • Scalability: quasi linear pre-computation & sub-linear query • Main points of this presentation • Outline of the algorithm • Pre-computation: external-memory, distributed • Query: used to increase precision • Error of approximation tends to zero exponentially • Exact vs. approximated PPV -- space lower bounds Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  5. Outline of the Algorithm • Theorem (Jeh, Widom ’03, F ’03) • Random walk starts from page u • Uniform step with probability 1-c, stops with c • PPV(u,v)=Pr{ the walk stops at page v } • Monte Carlo algorithm • Pre-computation • From u simulateN independent random walks • Database of fingerprints: ending vertices of the walks from all vertices • Query • PPV(u,v) : = # ( walks u→v ) / N Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  6. External memory pre-computation • Goal: N independent random walks from each vertex • Input: webgraph V ≈ 109,E ≈ 1010 • V+E > memory • Accessing the edges • Edge scan --- stream access • Edges sorted by source vertices Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  7. External memory pre-computation (2) • Goal: N independent random walks from each vertex • Simulate all walks together • Iteration: 1 blink = 1 edge scan • Sort path ends • Merge with the sorted graph • Each walk stops with prob. c • E( #walks ) = (1-c)k∙N∙V • after k iterations Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  8. Distributed indexing • M machines with fast local network connections • memory < V+E ≤ M∙(memory) Parallelize for N∙V walks Parts of the graph in RAM Remote transfers batched M=3 • Heuristic partition: one site to one machine • Machine1: www.cnn.com/*, Machine2: www.yahoo.com/* • Uniform load balance← ordinary PR distributed equally Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  9. Query, increasing precision • Database of N∙V fingerprints (path endings) • Query: PPV(u) : =empirical distribution • from N samples • Theorem (Jeh, Widom, ’03) • O(u)denotes out-neighbors of u • Query: PPV(u) : =empirical distribution • from N∙|O(u)| samples • Number of fingerprints for a query • F = N∙(db accesses/query) Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  10. Error of approximation • Exact:PPV(u,v) • Approximate by F fingerprints: PPV(u,v) • Theorem • If PPV(u,v) > PPV(u, holds, then • Pr{PPV(u,v) < PPV(u,w) } < exp( - 0.3∙N∙δ2 ) • Idea of the proof • N∙( PPV(u,v) - PPV(u,w) ) = #(u→v) - #(u→w) = • =sum of F iid. random variables with values {-1,0,1} • Bernstein’s inequality • Error of approximation→ 0exponentially with • F = (db size/vertex)∙(db accesses/query) → ∞ Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  11. Exact versus approximate • Model of computation • Input: G graph with V vertices • Pre-compute a database of size D • Query: respond by accessing only the db. • Exact • Query: u,v,w • Decide if PPV(u,v) > PPV(u,w) holds • Approximate for fixedε and δ • Query: u,v,w • Decide if PPV(u,v) > PPV(u,w) holds with error probability ε when | PPV(u,v) - PPV(u,w) | > δ Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  12. Lower bounds for the db size • For the webgraph V ≈109 • Theorem 1 • For the Exact problem D = Ω(V2) sized db is required in worst case • Theorem 2 • For the Approximate problem D = Ω(V) • Is it possible to improve the 2nd lower bound? • Our algorithm uses a D = O(V logV) sized db Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  13. Idea of the lower bound proofs • One-way communication complexity • Bit-vector probing (BVP) • Theorem: B ≥ mfor any protocol • Reduction from Exact-PPVto BVP Alice has a bit vector Input: x = (x1, x2, …, xm) Bob has a number Input: 1 ≤ k ≤ m Xk = ? Communication B bits Alice has x = (x1, x2, …, xm) G graph with V vertices, where V2 = m Pre-compute an Exact PPV database of size D Bob has 1 ≤ k ≤ m u, v, w vertices PPV(u,v) ? PPV(u,w) Xk = ? Communication Exact PPV db, D bits Thus D = B ≥ m= V2 Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  14. Summary • Fully personalized PR • Monte-Carlo method, not power iteration • Pre-computation • External-memory, distributed • Query • Increase precision by (db accesses/query) • Error of approximation • Tends to zero exponentially • Space lower bounds • Quadratic for Exact PPR • Linear for Approximate PPR

  15. Thank you! Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

  16. Misc • N∙PPV(u,v) = #(u→v) = Binom(N,PPV(u,v)) • Claim (by Chernoff’s bound): • Pr{PPV(u,v) > (1+δ) PPV(u,v) } < • exp(-N∙PPV(u,v)∙δ2/4) • If for a protocol Pr{right answer} ≥ (1+γ) / 2 then B ≥ γ∙m • PV PageRank vector, c constant, M normalized adjacency matrix, Towards Scaling Fully Personalized PageRank Dániel Fogaras, Balázs Rácz

More Related