1 / 30

If you were plowing a field, which would you rather use?

If you were plowing a field, which would you rather use?. Two oxen, or 1024 chickens? (Attributed to S. Cray). Our ‘field’ to plow : Graph processing. |V| = 1.4B, |E| = 6.6B. Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto Matei Ripeanu NetSysLab

pilis
Download Presentation

If you were plowing a field, which would you rather use?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

  2. Our ‘field’ to plow : Graph processing |V| = 1.4B, |E| = 6.6B

  3. Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto Matei Ripeanu NetSysLab The University of British Columbia http://netsyslab.ece.ubc.ca

  4. Graph Processing: The Challenges CPUs Poor locality Caches + summary data structures Data-dependent memory access patterns Low compute-to-memory access ratio >128GB Large memory footprint Varying degrees of parallelism (both intra- and inter- stage)

  5. Graph Processing: The GPU Opportunity GPUs CPUs Poor locality Caches + summary data structures Caches + summary data structures Data-dependent memory access patterns Massive hardware multithreading Low compute-to-memory access ratio >128GB Large memory footprint 6GB! Varying degrees of parallelism (both intra- and inter- stage) Assemble a heterogeneous platform

  6. Motivating Question Can we efficiently use hybrid systemsfor large-scale graph processing? YES WE CAN! 2x speedup (8 billion edges)

  7. Methodology • Performance Model • Predicts speedup • Intuitive • Totem • A graph processing engine for hybrid systems • Applies algorithm-agnostic optimizations • Evaluation • Predicated vs. achieved • Hybrid vs. Symmetric

  8. c = b β= rcpu= α= The Performance Model (I) Goal: Predict the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)

  9. c = β = rcpu = β= 20% rcpu= 0.5 BEPS x Best reported single-node BFS performance [Agarwal, V. 2010] α = The Performance Model (II) Assume PCI-E bus, b ≈ 4 GB/sec and per edge state m = 4 bytes => c = 1 billion EPS Worst case (e.g., bipartite graph) |V| = 32M, |E| = 1B It is beneficial to process the graph on a hybrid system if communication overhead is low

  10. . . . Totem: Programming Model Bulk Synchronous Parallel • Rounds of computation and communication phases • Updates to remote vertices are delivered in the next round • Partitions vote to terminate execution

  11. Totem: A BSP-based Engine Compressed sparse row representation Computation: kernel manipulates local state Comm1: transfer outbox buffer to remote input buffer Comm2: merge with local state Updates to remote vertices aggregated locally

  12. |E| = 512 Million sparse graph: ~5x reduction Random The Aggregation Opportunity real-world graphs are mostly scale-free: skewed degree distribution Denser graph has better opportunity for aggregation: ~50x reduction

  13. Evaluation Setup Workload • R-MAT graphs • |V|=32M, |E|=1B, unless otherwise noted Algorithms • Breadth-first Search • PageRank Metrics • Speedup compared to processing on the host only Testbed • Host: dual-socket Intel Xeon with 16GB • GPU: Nvidia Tesla C2050 with 3GB

  14. Predicted vs. Achieved Speedup Linear speedup with respect to offloaded part After aggregation, β = 2%. A low value is critical for BFS GPU partition fills GPU memory

  15. Breakdown of Execution Time Aggregation significantly reduced communication overhead PageRank is dominated by the compute phase GPU is > 5x faster than the host

  16. So far … • Performance modeling • Simple • Useful for initial system provisioning • Totem • Generic graph processing framework • Algorithm-agnostic optimizations • Evaluation (Graph500 scale-28) • 2x speedup over a symmetric system • 1.13 Billion TEPS edges on a dual-socket, dual-GPU system But, random partitioning! Can we do better?

  17. Better partitioning strategies. The search space. • Handles large (billion-edge scale) graphs. • Low space and time complexity. • Ideally, quasi-linear! • Handles well scale-free graphs. • Minimizes algorithm’s execution time by reducing computation time • (rather than communication)

  18. The strategies we explore • High: vertices with high degree left on the host • Low: vertices with low degree left on the host • Rand: random The percentage of vertices placed on the CPU for a scale-28 RMAT graph (|V|=256m, |E|=4B)

  19. Evaluation platform Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075 (2x sockets) (2x GPUs) Core Frequency 2.67GHz 1.15GHz Num Cores (SMs) 6 14 HW-thread/Core 2 x 48warps (x32/warp) Last Level Cache 12MB 2MB Main Memory 144GB 6GB Memory Bandwidth 32GB/sec 144GB/sec Total Power (TDP) 95W 225W

  20. BSF performance Exploring the 75% data point 2x performance gain! LOW: No gain over random! BFS traversal rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)

  21. BSF performance – more details Host is the bottleneck in all cases !

  22. PageRank performance 25% performance gain! LOW: Minimal gain over random! Better packing PageRank processing rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)

  23. Small graphs(scale-25 RMAT graphs: |V|=32m, |E|=512m) BFS PageRank • Intelligent partitioning provides benefits • Key for performance: load balancing

  24. Uniform graphs (not scale free) BFS on scale-25 uniform graph |V|=32m, |E|=512m) BFS on scale-28 • Hybrid techniques not useful • for uniform graphs

  25. Scalability Graph size: RMAT graphs: scale 25 to 29 (|V|=512m, |E|=8B)Platform size: 1,2, 4 sockets  2xsockets + 2 x GPU BFS PageRank

  26. Power Normalizing by power (TDP – thermal design power)Metric: million TEPS / watt BFS PageRank 1.9 2.3 1.3 1.1 2.0 1.3 1.3 1.8 1.4 1.0 1.8 2.4

  27. Conclusions • Q: Does it make sense to use a hybrid system? • A:Yes! (for large scale-free graphs) • Q: Can one design a processing engine for hybrid platforms that both generic and efficient? • A: Yes. • Q:Are there near-linear complexity partitioning strategies that enable higher performance? • A:Yes, partitioning strategies based on vertex connectivity provide in all cases better performance than random. • Q: Should one search for partitioning strategies that reduce the communication overheads (and hope for higher performance)? • A: No. (for scale free graphs) • Q:Which strategies work best? • A: It depends! • Large graphs: shape the load. Small graphs: load balancing.

  28. If you were plowing a field, which would you rather use? - Two oxen, or 1024 chickens? - Both!

  29. code available at: netsyslab.ece.ubc.ca • Papers: • A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, PACT 2012 • On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, IPDPS 2013 29

  30. Networked Systems Laboratory (NetSysLab) University of British Columbia A golf course … … a (nudist) beach (… and 199 days of rain each year)

More Related