300 likes | 427 Views
If you were plowing a field, which would you rather use?. Two oxen, or 1024 chickens? (Attributed to S. Cray). Our ‘field’ to plow : Graph processing. |V| = 1.4B, |E| = 6.6B. Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto Matei Ripeanu NetSysLab
E N D
If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
Our ‘field’ to plow : Graph processing |V| = 1.4B, |E| = 6.6B
Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto Matei Ripeanu NetSysLab The University of British Columbia http://netsyslab.ece.ubc.ca
Graph Processing: The Challenges CPUs Poor locality Caches + summary data structures Data-dependent memory access patterns Low compute-to-memory access ratio >128GB Large memory footprint Varying degrees of parallelism (both intra- and inter- stage)
Graph Processing: The GPU Opportunity GPUs CPUs Poor locality Caches + summary data structures Caches + summary data structures Data-dependent memory access patterns Massive hardware multithreading Low compute-to-memory access ratio >128GB Large memory footprint 6GB! Varying degrees of parallelism (both intra- and inter- stage) Assemble a heterogeneous platform
Motivating Question Can we efficiently use hybrid systemsfor large-scale graph processing? YES WE CAN! 2x speedup (8 billion edges)
Methodology • Performance Model • Predicts speedup • Intuitive • Totem • A graph processing engine for hybrid systems • Applies algorithm-agnostic optimizations • Evaluation • Predicated vs. achieved • Hybrid vs. Symmetric
c = b β= rcpu= α= The Performance Model (I) Goal: Predict the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)
c = β = rcpu = β= 20% rcpu= 0.5 BEPS x Best reported single-node BFS performance [Agarwal, V. 2010] α = The Performance Model (II) Assume PCI-E bus, b ≈ 4 GB/sec and per edge state m = 4 bytes => c = 1 billion EPS Worst case (e.g., bipartite graph) |V| = 32M, |E| = 1B It is beneficial to process the graph on a hybrid system if communication overhead is low
. . . Totem: Programming Model Bulk Synchronous Parallel • Rounds of computation and communication phases • Updates to remote vertices are delivered in the next round • Partitions vote to terminate execution
Totem: A BSP-based Engine Compressed sparse row representation Computation: kernel manipulates local state Comm1: transfer outbox buffer to remote input buffer Comm2: merge with local state Updates to remote vertices aggregated locally
|E| = 512 Million sparse graph: ~5x reduction Random The Aggregation Opportunity real-world graphs are mostly scale-free: skewed degree distribution Denser graph has better opportunity for aggregation: ~50x reduction
Evaluation Setup Workload • R-MAT graphs • |V|=32M, |E|=1B, unless otherwise noted Algorithms • Breadth-first Search • PageRank Metrics • Speedup compared to processing on the host only Testbed • Host: dual-socket Intel Xeon with 16GB • GPU: Nvidia Tesla C2050 with 3GB
Predicted vs. Achieved Speedup Linear speedup with respect to offloaded part After aggregation, β = 2%. A low value is critical for BFS GPU partition fills GPU memory
Breakdown of Execution Time Aggregation significantly reduced communication overhead PageRank is dominated by the compute phase GPU is > 5x faster than the host
So far … • Performance modeling • Simple • Useful for initial system provisioning • Totem • Generic graph processing framework • Algorithm-agnostic optimizations • Evaluation (Graph500 scale-28) • 2x speedup over a symmetric system • 1.13 Billion TEPS edges on a dual-socket, dual-GPU system But, random partitioning! Can we do better?
Better partitioning strategies. The search space. • Handles large (billion-edge scale) graphs. • Low space and time complexity. • Ideally, quasi-linear! • Handles well scale-free graphs. • Minimizes algorithm’s execution time by reducing computation time • (rather than communication)
The strategies we explore • High: vertices with high degree left on the host • Low: vertices with low degree left on the host • Rand: random The percentage of vertices placed on the CPU for a scale-28 RMAT graph (|V|=256m, |E|=4B)
Evaluation platform Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075 (2x sockets) (2x GPUs) Core Frequency 2.67GHz 1.15GHz Num Cores (SMs) 6 14 HW-thread/Core 2 x 48warps (x32/warp) Last Level Cache 12MB 2MB Main Memory 144GB 6GB Memory Bandwidth 32GB/sec 144GB/sec Total Power (TDP) 95W 225W
BSF performance Exploring the 75% data point 2x performance gain! LOW: No gain over random! BFS traversal rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)
BSF performance – more details Host is the bottleneck in all cases !
PageRank performance 25% performance gain! LOW: Minimal gain over random! Better packing PageRank processing rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)
Small graphs(scale-25 RMAT graphs: |V|=32m, |E|=512m) BFS PageRank • Intelligent partitioning provides benefits • Key for performance: load balancing
Uniform graphs (not scale free) BFS on scale-25 uniform graph |V|=32m, |E|=512m) BFS on scale-28 • Hybrid techniques not useful • for uniform graphs
Scalability Graph size: RMAT graphs: scale 25 to 29 (|V|=512m, |E|=8B)Platform size: 1,2, 4 sockets 2xsockets + 2 x GPU BFS PageRank
Power Normalizing by power (TDP – thermal design power)Metric: million TEPS / watt BFS PageRank 1.9 2.3 1.3 1.1 2.0 1.3 1.3 1.8 1.4 1.0 1.8 2.4
Conclusions • Q: Does it make sense to use a hybrid system? • A:Yes! (for large scale-free graphs) • Q: Can one design a processing engine for hybrid platforms that both generic and efficient? • A: Yes. • Q:Are there near-linear complexity partitioning strategies that enable higher performance? • A:Yes, partitioning strategies based on vertex connectivity provide in all cases better performance than random. • Q: Should one search for partitioning strategies that reduce the communication overheads (and hope for higher performance)? • A: No. (for scale free graphs) • Q:Which strategies work best? • A: It depends! • Large graphs: shape the load. Small graphs: load balancing.
If you were plowing a field, which would you rather use? - Two oxen, or 1024 chickens? - Both!
code available at: netsyslab.ece.ubc.ca • Papers: • A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, PACT 2012 • On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, IPDPS 2013 29
Networked Systems Laboratory (NetSysLab) University of British Columbia A golf course … … a (nudist) beach (… and 199 days of rain each year)