1 / 17

If you were plowing a field, which would you rather use?

If you were plowing a field, which would you rather use?. Two oxen, or 1024 chickens? (Attributed to S. Cray). A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing. Abdullah Gharaibeh , Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab

will
Download Presentation

If you were plowing a field, which would you rather use?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

  2. A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University of British Columbia http://netsyslab.ece.ubc.ca

  3. Graphs are Everywhere

  4. [Hong, S. 2011] Graphs Processing Challenges GPUs CPUs • Caches + summary data structures • Low compute-to-memory access ratio • Massive hardware multithreading • Poor locality • >128GB • 6GB • Large memory footprint

  5. Motivating Question Can we efficiently use hybrid systemsfor large-scale graph processing? YES WE CAN! 2x speedup (4 billion edges)

  6. Methodology • Performance Model • Predicts speedup • Intuitive • Totem • A graph processing engine for hybrid systems • Applies algorithm-agnostic optimizations • Evaluation • Predicated vs achieved • Hybrid vs Symmetric

  7. c = b β= rcpu= α= The Performance Model (I) • Predicts the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)

  8. c = β = rcpu = β= 20% rcpu= 0.5 BEPS x Best reported single-node BFS performance [Agarwal, V. 2010] α = The Performance Model (II) Assume PCI-E bus, b ≈ 4 GB/sec and per edge state m = 4 bytes => c = 1 billion EPS Worst case (e.g., bipartite graph) |V| = 32M, |E| = 1B • It is beneficial to process the graph on a hybrid system • if communication overhead is kept low

  9. . . . Totem: Programming Model Bulk Synchronous Parallel • Rounds of computation and communication phases • Updates to remote vertices are delivered in the next round • Partitions vote to terminate execution

  10. Totem: A BSP-based Engine Compressed sparse row representation Computation: kernel manipulates local state Comm1: transfer outbox buffer to remote input buffer Comm2: merge with local state Updates to remote vertices aggregated locally

  11. |E| = 512 Million sparse graph: ~5x reduction Random The Aggregation Opportunity real-world graphs are mostly scale-free: skewed degree distribution Denser graph has better opportunity for aggregation: ~50x reduction

  12. Evaluation Setup Workload • R-MAT graphs • |V|=32M, |E|=1B, unless otherwise noted Algorithms • Breadth-first Search • PageRank Metrics • Speedup compared to processing on the host only Testbed • Host: dual-socket Intel Xeon with 16GB • GPU: Nvidia Tesla C2050 with 3GB

  13. Predicted vs Achieved Speedup Linear speedup with respect to offloaded part After aggregation, β = 2%. A low value is critical for BFS GPU partition fills GPU memory

  14. Breakdown of Execution Time Aggregation significantly reduced communication overhead PageRank is dominated by the compute phase GPU is > 5x faster than the host

  15. Effect of Graph Density Deviation due to not incorporating pre- and post-transfer overheads in the model Sparser graphs have higher β

  16. Contributions • Performance modeling • Simple • Useful for initial system provisioning • Totem • Generic graph processing framework • Algorithm-agnostic optimizations • Evaluation (Graph500 scale-28) • 2x speedup over a symmetric system • 1.13 Billion TEPS edges on a dual-socket, dual-GPU system

  17. Questions? • code available at: netsyslab.ece.ubc.ca 17

More Related