Managing Large Graphs on Multi-Cores With Graph Awareness

Managing Large Graphs on Multi-Cores With Graph Awareness Vijayan, Ming, Xuetian, Frank, Lidong, Maya Microsoft Research

Motivation • Tremendous increase in graph data and applications • New class of graph applications that require real-time responses • Even batch-processed workloads have strict time-constraints • Multi-core revolution • Default standards on most machines • Large-scale multi-cores with terabytes of main memory • Run workloads that are traditionally run on distributed systems • Existing graph-processing systems lack support for both

A High-level Description of Grace Outline Overview Details of optimizations Details on transactions Subset of results Grace is an in-memory graph management and processing system Implements several optimizations • Graph-specific • Multi-core-specific Supports snapshots and transactional updates on graphs Evaluation shows that optimizations help Grace run several times faster than other alternatives

An Overview of Grace • Keeps an entire graph in memory in smaller parts. • Exposes C-style API for writing graph workloads, iterative workloads, and updates. • Design driven by two trends • Graph-specific locality • Partitionable and parallelizable workloads B Grace API v = GetVertex(Id) for (i=0; i<v.degree;i++) neigh=v.GetNeighbor(i) Iterative Programs (e.g., PageRank) A C E Graph and Multi-core Optimizations D Net RPC Core 0 Core 1

Data Structures Vertex Log A B C Edge Log C B C B C C Edges of A Edges of B Edges of C C A B 1 0 2 A B D C Edge Pointer Array 0 1 1 1 Vertex Index Vertex Allocation Map Data Structures in a Partition

Graph-Aware Partitioning & Placement • Partitioning and placement – are they useful on a single machine? • Yes, to take advantage of multi-cores and memory hierarchies • Solve them using graph partitioning algorithms • Divide a graph into sub-graphs, minimizing edge-cuts • Grace provides an extensible library • Graph-aware: heuristic-based, spectral partitioning, Metis • Graph-agnostic: hash partitioning • Achieve better layout by recursive graph partitioning • Recursively run graph partition until a sub-graph can fit in a cache line • Recompose all the sub-graphs to get the vertex layout

Platform for Parallel Iterative Computations Iterative computation platform implements “bulk synchronous parallel” model. Parallel computations Iteration 1 Propagate updates Barrier Iteration 2

Load Balancing and Updates Batching Solution1: Load balancing is implemented by sharing a portion of vertices Problem2: Updates in arbitrary order can increase cache misses Problem1: overloaded partitions can affect performance • Solution2: Updates batching is implemented by • grouping updates by their destination part • Issuing updates in a round-robin fashion Barrier Cache line B D A C Part1 Core1 Part2 Core2 Part0 Core0

Transactions on Graphs • Grace supports structural changes to a graph • BeginTransaction() • AddVertex(X) • AddEdge(X, Y) • EndTransaction() • Transactions use snapshot isolation • Instantaneous snapshots using CoW techniques • CoW can affect careful memory layout!

Evaluation • Graphs: • Web (v:88M, e:275M), sparse • Orkut (v:3M, e:223M), dense • Workloads: • N-hop-neighbor queries, BFS, DFS, PageRank, Weakly-Connected Components, Shortest Path • Architecture: • Intel Xeon-12 cores, 2 chips with 6 cores each • AMD Opteron-48 cores, 4 chips with 12 cores each • Questions: • How well partitioning and placement work? • How useful are load balancing and updates batching? • How does Grace compare to other systems?

Partitioning and Placement Performance On Intel PageRank Speedup Orkut graph partitions Web graph partitions 2 1 3 Observation: For smaller number of partitions, partition algorithm didn’t make a big difference Reason: All the partitions fit within cores of single chip minimizing communication cost Observation: Careful vertex arrangement works better when graph partitioning is used for sparse graphs Reason: graph partitioning puts neighbors under same part helping better placement Observation: Placing neighboring vertices close together improves performance significantly Reason: L1, L2, and L3 cache and Data-TLB misses are reduced

Load Balancing and Updates Batching On Intel PageRank Speedup Orkut graph partitions Web graph partitions 2 1 Observation: Load balancing and updates batching didn’t improve performance for web graph Reason: Sparse graphs can be partitioned better and there are fewer updates to send Observation: Batching updates gives better performance improvement for Orkut graph Reason: Updates batching reduces remote cache accesses Retired Load

Comparing Grace, BDB, and Neo4j Running Time (s)

Conclusion Grace explores graph-specific and multi-core specific optimizations What worked and what didn’t (in our setup; your mileage might differ) • Careful vertex placement in memory gave good improvements • Partitioning and updates batching worked in most cases, but not always • Load balancing wasn’t as useful

Managing Large Graphs on Multi-Cores With Graph Awareness