Streaming Graph Partitioning for Large Distributed Graphs

Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft Research XCG

Motivation • Modern graph datasets are huge • The web graph had over a trillion links in 2011. Now? • facebook has “more than 901 million users with average degree 130” • Protein networks

Motivation • We still need to perform computations, so we have to deal with large data • PageRank (and other matrix-multiply problems) • Broadcasting status updates • Database queries • And on and on and on… Graph has to be distributed across a cluster of machines! P QL

Motivation • Edges cut correspond (approximately) to communication volume required • Too expensive to move data on the network • Interprocessor communication: nanoseconds • Network communication: microseconds • The data has to be loaded onto the cluster at some point… • Can we partition while we load the data?

High Level Background • Graph partitioning is NP-hard on a good day • But then we made it harder: • Graphs like social networks are notoriously difficult to partition (expander-like) • Large data sets drastically reduce the amount of computation that is feasible – O(n) or less • The partitioning algorithms need to be parallel and distributed

The Streaming Model Possible Buffer of size Each machine holds nodes Graph Stream → • Graph is ordered: • Random • Breadth-First Search • Depth-First Search Partitioner Goal: Generate an approximately balanced k-partitioning

Lower Bounds On Orderings Best balanced -partition cuts edges • Adversarial Ordering • Give every other vertex • See no edges till ! • Can’t compete • DFS Ordering • Stream is connected • Greedy will do optimally Theory says these types of algorithms can’t do well • Random Ordering • Birthday paradox: won’t see edges until • Still can’t compete with edges cut

Current Approach in Real Systems • Totally ignore edges and hash vertex ID • Pro • Fast to locate data • Doesn’t require a complex DHT or synchronization • Con • Hashing the vertex ID cuts a fraction of the edges for any order • Great simple approximation for MAX-CUT

Our Approach • Evaluate 16 natural heuristics on 21 datasets with each of the three orderings with varying numbers of partitions • Find out which heuristics work on each graph • Compare these with the results of • Random Hashing to get worst case • METIS to get ‘best’ offline performance

Caveats • METIS is a heuristic, not true lower bound • Does fine in practice • Available online for reproducing results • Used publicly available datasets • Public graph datasets tend to be much smaller than what companies have • Using meta-data for partitioning can be good • partitioning the web graph by URL • Using geographic location for social network users

Heuristics • Balanced • Chunking • Hashing • (weighted) Deterministic Greedy • (weighted) Randomized Greedy • Triangles • Balance Big Uses a Buffer of size • Prefer Big • Avoid Big • Greedy EvoCut Weight functions Unweighted Linear weighted Exponentially weighted

Datasets • Includes finite element meshes, citation networks, social networks, web graphs, protein networks and synthetically generated graphs • Sizes: 297 vertices to 41.7 million vertices • Synthetic graph models • Barabasi-Albert (Preferential Attachment) • RMAT (Kronecker) • Watts-Strogatz • Power law-Clustered • Biggest graphs: LiveJournal and Twitter

Experimental Method • For each graph, heuristic, and ordering, partition into 2, 4, 8, 16 pieces • Compare with a random cut – upper bound • Compare with METIS – lower bound • Performance was measured by:

BFS DFS Random Heuristic Results Synthetic Hash Best heuristic, LDG, gets an average improvement of 76% over all datasets! METIS Finite element mesh Social network

Scaling in the Size of Graphs: Exploiting Synthetic Graphs Hash LDG METIS

More Observations • BFS is a superior ordering for all algorithms • Avoid Big does 46% WORSE on average than Random Cut • Further experiments showed Linear Det. Greedy has identical performance to Det. Greedy with load-based tie breaking.

Results on a Real System • Compared the streamed partitioning with random hashing on SPARK, a distributed cluster computation system (http://www.spark-project.org/) • Used 2 datasets • 4.6 million users, 77 million edges • 41.7 million users, 1.468 billion edges • Computed the PageRank of each graph

Results on SPARK • LiveJournal – 4.6 million users, 77 million edges • Twitter – 41.7 million users, 1.468 billion edges Twitter Improvement: Naïve – 19.1% Combiner – 18.8 % LJ Improvement: Naïve – 38.7% Combiner – 28.8 %

Streaming graph partitioning is a really nice, simple, effective preprocessing step.

isabelle@eecs.berkeley.edu Where to now? • Can we explain theoretically why the greedy algorithm performs so well?* • What heuristics work better? • What heuristics are optimal for different classes of graphs? • Use multiple parallel streams! • Implement in real systems! *Work under submission: I. Stanton, Streaming Balanced Graph Partitioning Algorithms for Random Graphs

isabelle@eecs.berkeley.edu Acknowledgements • David B. Wecker • Burton Smith • Reid Andersen • Nikhil Devanur • SamehElkinety • SreenivasGollapudi • YuxiongHe • RinaPanigrahy • Yuval Peres All at MSR • SatishRao • Virginia Vassilevska Williams • Alexandre Stauffer • Ngoc Mai Tran • MiklosRacz • MateiZaharia All at Berkeley - CS and Statistics Supported by NSF and NDSEG fellowships, NSF grant CCF-0830797, and an internship at Microsoft Research’s eXtreme Computing Group.

Streaming Graph Partitioning for Large Distributed Graphs

Streaming Graph Partitioning for Large Distributed Graphs

Presentation Transcript

Balanced Graph Partitioning

A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms

Graph Partitioning

Graph Partitioning in Parallel

Managing Large Graphs on Multi-Cores With Graph Awareness

Streaming and MapReduce for Graphs

Distributed Algorithms for Graph coloring

Differentiated Graph Computation and Partitioning on Skewed Graphs

Distributed Graph Processing

Graph Partitioning

Large Graph Algorithms

Distributed Parallel Inference on Large Factor Graphs

Graph Partitioning

BiGraph : Bipartite-oriented Distributed Graph Partitioning for Big Learning

Graph Partitioning Problems

A multiagent algorithm for graph partitioning

Graph Partitioning

Large-Scale System Partitioning

Graph Partitioning

Min-Max Graph Partitioning

Graph Partitioning

Graph Partitioning Problems