1 / 22

If you were plowing a field, which would you rather use?

If you were plowing a field, which would you rather use?. Two oxen, or 1024 chickens? (Attributed to S. Cray). Abdullah Gharaibeh , Lauro Costa, Elizeu Santos-Neto and Matei Ripeanu Netsyslab The University of British Columbia. 1. The ‘Field’ to Plow : Graph processing. 1B users

victor-key
Download Presentation

If you were plowing a field, which would you rather use?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. If you were plowing a field, which would you rather use? • Two oxen, or • 1024 chickens? • (Attributed to S. Cray) • Abdullah Gharaibeh, Lauro Costa, • Elizeu Santos-Neto and Matei Ripeanu • Netsyslab • The University of British Columbia • 1

  2. The ‘Field’ to Plow : Graph processing • 1B users • 150B friendships • 100B neurons • 700T connections • 1.4B pages, 6.6B links

  3. Graph Processing Challenges • CPUs • Poor locality • Caches + summary data structures • Data-dependent memory access patterns • Low compute-to-memory access ratio • >256GB • Large memory footprint • Varying degrees of parallelism • (both intra- and inter- stage)

  4. The GPU Opportunity • GPUs • CPUs • Poor locality • Caches + summary data structures • Caches + summary data structures • Data-dependent memory access patterns • Massive hardware multithreading • Low compute-to-memory access ratio • >256GB • Large memory footprint • 6GB! • Assemble a • hybrid platform • Varying degrees of parallelism • (both intra- and inter- stage)

  5. Motivating Question • Can we efficiently use hybrid systemsfor large-scale graph processing? • YES WE CAN! • 2x speedup • (4 billion edges)

  6. Methodology • Performance Model • Predicts speedup • Intuitive Totem • A graph processing engine for hybrid systems • Applies algorithm-agnostic optimizations Partitioning Strategies • Can one do better than random? Evaluation • Partitioning strategies • Hybrid vs. Symmetric A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing, PACT 2012 On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, IPDPS 2013 • 6

  7. Totem: Compute Model Bulk Synchronous Parallel Rounds of computation and communication phases Updates to remote vertices are delivered in the next round Partitions vote to terminate execution • . • . • . • 7

  8. Totem: Programming Model • A user-defined kernel runs simultaneously on each partition • Vertices are processed in parallel within each partition • Messages to remote vertices are combined if a combiner function is defined msg_combine msg_combine_func

  9. Target Platform • Intel Nehalem Fermi GPU • Xeon X5650 Tesla C2075 • (2x sockets) (2x GPUs) • Core Frequency 2.67GHz 1.15GHz • Num HW-threads 12 448 • Last Level Cache 12MB 2MB • Main Memory 144GB 6GB • Memory Bandwidth 32GB/sec 144GB/sec • 9

  10. Level 0 Level 1 5 9 2 4 Use Case: Breadth-first Search • Graph Traversal algorithm • Very little computation per vertex • Sensitive to cache utilization Visited bit-vector

  11. BSF Performance Can we do better than random partitioning? Linear improvement with respect to offloaded part Limited GPU memory prevents offloading more • Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B) • 1S1G = 1 CPU Socket + 1 GPU • 11

  12. BSF Performance Computation phase dominates run time • Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B) • 1S1G = 1 CPU Socket + 1 GPU • 12

  13. Workload-Processer Matchmaking Many low degree vertices • Observation: Real-world graphs have power-law edge degree distribution • Idea: Place the many low-degreevertices on GPU Place the few high-degreevertices on CPU Few high degree vertices Log-log plot of edge degree distribution for LiveJournal social network

  14. BSF performance Exploring the 75% data point 2x performance gain! Specialization leads to superlinear speedup LOW: No gain over random! Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B) 1S1G = 1 CPU Socket + 1 GPU 14

  15. BSF performance – more details HIGH leads to better load balancing Host is the bottleneck in all cases!

  16. PageRank: A More Compute Intensive Use Case 25% performance gain! LOW: Minimal gain over andom! Better packing Workload: scale-28 Graph500 benchmark (|V|=256m, |E|=4B) 1S1G = 1 CPU Socket + 1 GPU 16

  17. Same Analysis, Smaller Graph(scale-25 RMAT graphs: |V|=32m, |E|=512m) PageRank BFS • Intelligent partitioning provides benefits • Key for performance: load balancing 17

  18. Worst Case: Uniform Graph(not scale free) BFS on scale-25 uniform graph (|V|=32m, |E|=512m) Hybrid techniques not useful for uniform graphs 18

  19. Scalability: BFS • Comparable to top 100 in Graph500 challenge! • Hybrid offers ~2x speedup vs Symmetric • GPU memory is limited, GPU share < 1% of the edges! S = CPU Socket G = GPU • 19

  20. Scalability: PageRank Hybrid offers better gains for more compute intensive algorithms S = CPU Socket G = GPU • 20

  21. Summary • Problem: large-scale graph processing is challenging • Heterogeneous workload – power-law degree distribution • Solution: Totem Framework • Harnesses hybrid systems – Utilizes the strengths of CPUs and GPUs • Powerful abstraction – simplifies implementing algorithms for hybrid systems • Partitioning – workload to processor matchmaking • Algorithms – BFS, SSSP, PageRank, Centrality Measures

  22. Questions code@: netsyslab.ece.ubc.ca

More Related