1 / 48

Gluon- Async : A Bulk-Asynchronous System for Distributed and Heterogeneous Graph Analytics

Explore Gluon-Async, a cutting-edge Bulk-Asynchronous Parallel System for Distributed & Heterogeneous Graph Analytics. Learn about the breakthrough in asynchronous programming/model that outperforms traditional BSP systems.

luster
Download Presentation

Gluon- Async : A Bulk-Asynchronous System for Distributed and Heterogeneous Graph Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Roshan Dathathri          Gurbinder Gill          Loc Hoang Hoang-Vu Dang          VishweshJatalaV. Krishna Nandivada Marc Snir          Keshav Pingali Gluon-Async: A Bulk-Asynchronous System for Distributed and Heterogeneous Graph Analytics

  2. Graph Analytics Applications: machine learning and network analysis Datasets: unstructured graphs Need TBs of memory Credits: Wikipedia, SFL Scientific, MakeUseOf Credits: Sentinel Visualizer

  3. Motivation • Most distributed graph analytics systems are bulk-synchronous parallel (BSP) • Gluon [PLDI’18], Lux [VLDB’18], Gemini [OSDI’16], PowerGraph [OSDI’12], ...  • Dynamic workloads => stragglers limit scaling • Asynchronous distributed graph analytics systems are restricted and not competitive • GRAPE+ [SIGMOD’18], PowerSwitch[PPoPP’15], ASPIRE [OOPSLA’14], GraphLab[VLDB’12], … • Small messages => high synchronization overheads => slower than BSP systems • No way to reuse infrastructure to leverage GPUs

  4. Bulk-Asynchronous Parallel (BASP) • Exploits resilience of graph applications to stale reads • Novel asynchronous programming and execution model • Retains the advantages of bulk-computation and bulk-communication • Allows progress by eliding blocking/waiting during communication • Novel non-blocking reconciliation of updates to the graph • Novel non-blocking termination detection algorithm • Easy to adapt BSP programs and systems to BASP

  5. Gluon-Async: A BASP System • Adapted Gluon, the state-of-the-art distributed graph analytics system, to build Gluon-Async • First asynchronous distributed GPU graph analytics system • Supports arbitrary partitioning policies in CuSP with Gluon’s communication optimizations GPU IrGL/CUDA Gluon Plugin Gluon Comm. Runtime CPU Galois CPU Gluon Plugin Gluon-Async Comm. Gluon-Async Comm. CuSPPartitioner CuSPPartitioner Network (LCI/MPI) Network (LCI/MPI) Gluon [PLDI’18] Galois [SoSP’13] CuSP [IPDPS’19] IrGL [OOPSLA’16] LCI [IPDPS’18] • Gluon-Async is ~1.5x faster than Gluon(-Sync) at scale

  6. Outline • Gluon Synchronization Approach • Bulk-Asynchronous Parallel (BASP) Execution Model • Adapting BSP systems to BASP • Bulk-Asynchronous Communication • Non-Blocking Termination Detection • Adapting BSP programs to BASP • Experimental Results

  7. Gluon Synchronization Approach

  8. Vertex Programming Model • Every node has a label • e.g., distance in single source shortest path (SSSP) • Apply an operator on an active node in the graph • e.g., relaxation operator in SSSP • Push-style: reads its label and writes to neighbors’ labels • Pull-style: reads neighbors’ labels and writes to its label • Termination: no more active nodes (or work) • Applications: breadth first search, connected component, k-core, pagerank, single source shortest path, etc. R W push-style R W pull-style

  9. Partitioning Host h1 Host h2 B C D A F G H E I J Partitions of the graph Original graph

  10. Partitioning Host h1 • Each edge is assigned to a unique host Host h2 B C D A F G H E I J Partitions of the graph Original graph

  11. Partitioning Host h1 • Each edge is assigned to a unique host • All edges connect proxy nodes on the same host Host h2 C B C B B C D A D A G F G F F G H E H E J J I I J Partitions of the graph Original graph

  12. Partitioning Host h1 • Each edge is assigned to a unique host • All edges connect proxy nodes on the same host • A node can have multiple proxies: one is master proxy; rest are mirror proxies Host h2 C B C B B C D A D A G F G F F G H E H E J J I I J : Master proxy : Mirror proxy Partitions of the graph Original graph

  13. How does Gluon synchronize the proxies? • Exploit domain knowledge • Cached copies can be stale as long as they are eventually synchronized Host h1 Host h2 1 8 B C C B 0 8 D A F G G F 1 8 8 H E J J I : Master proxy : Mirror proxy : distance (label) from source A

  14. How does Gluon synchronize the proxies? • Exploit domain knowledge • Cached copies can be stale as long as they are eventually synchronized • Use all-reduce: • Reduce from mirror proxies to master proxy • Broadcast from master proxy to mirror proxies Host h1 Host h2 1 B C C B 0 8 1 D A F G G F 1 1 8 H E J J I : Master proxy : Mirror proxy : distance (label) from source A

  15. Bulk-Asynchronous Parallel (BASP) Execution Model

  16. Bulk-Synchronous Parallel (BSP) • Execution occurs in rounds • In each round: • Each host computes independently • Each host sends a message to every other host • Each host ingests a message from every other host • Virtual barrier at the end • BSP • Host h1 • Host h2 • Compute • Idle • Communicate

  17. Bulk-Asynchronous Parallel (BASP) • Execution occurs in rounds • In each local round: • Each host computes independently • Each host can send messages to other hosts • Each host can ingest messages from other hosts • No waiting or blocking • BSP • Host h1 • Host h2 • BASP • Host h1 • Host h2 • Compute • Idle • Communicate

  18. Discussion: BSP vs. BASP • BASP exploits domain knowledge • Cached copies can be stale as long as they are eventually synchronized • Advantages of BASP • Faster hosts may progress and send updated values • Straggler hosts may receive updated values => lesser computation • Communication may overlap with computation • Disadvantages of BASP • Faster hosts may use stale values => redundant computation • Host h1 • Host h2

  19. Challenges in Realizing BASP Systems Removing barrier changes execution semantics: • How to synchronize proxies asynchronously? • How to detect termination without blocking?

  20. Bulk-Asynchronous Communication

  21. Communication: BSP vs. BASP • Problem: synchronization of different proxies of the same vertex • Straightforward in BSP • Requirement: Same value as sequential at the end of a round • Achieved by all-reduce at the end of a round • More complicated in BASP • Requirement: Same value as sequential eventually • Must be achieved without blocking • Values sent in one round can be received in another

  22. Non-Blocking Synchronization of Proxies • Consider the master proxy of vertex v on h1 and its mirror proxy on h2 • Reduction: • Mirror updated => value sent to master and reset • Master received messages => reduce on its value • Broadcast: • Master updated => value sent to mirror • Mirror received messages => reduce on its value • Master on h1 7 5 3 8 Min Min • Mirror on h2 9 3 8 7

  23. Bulk-Asynchronous Communication • Consider two hosts: h2 has mirror proxies for master proxies on h1 • Reduction – h2 : • Aggregates all updated mirrors into one message • Sends message only if non-empty • Broadcast – h1 : • Aggregates all updated masters into one message • Sends message only if non-empty • Reduction – h1 : • May receive zero or more messages • Broadcast – h2 : • May receive zero or more messages

  24. Non-Blocking Termination Detection

  25. Termination Detection: BSP vs. BASP • Semantics: No host should terminate if there is work left • Trivial in BSP • Condition: All hosts are inactive in a round • Implementation: Use distributed accumulator (blocking collective) • More complicated in BASP • Cannot use blocking collectives • Conditions are not clear

  26. Termination Detection Algorithm • Invoked at the end of each local round on each host • Implements a distributed consensus protocol • Does not rely on message delivery order • Hosts can directly send/receive messages to each other (clique network) • Uses a state machine on each host • Uses non-blocking collectives or snapshots to coordinate among hosts • Snapshots are numbered • A snapshot broadcasts the current state

  27. States and Goal • Goal: a host must move to T if and only if every other host will move to T • Intuition: hosts move to T only if every host knows that “every host knows that every host wants to move to T” • Requires two RT states A States:  • A: Active • I: Idle • RT1: Ready-to-Terminate1 • RT2: Ready-to-Terminate2 • T: Terminate I RT1 RT2 T

  28. State Transitions A States:  • A: Active • I: Idle • RT1: Ready-to-Terminate1 • RT2: Ready-to-Terminate2 • T: Terminate Conditions for transition: inactive active prior snapshot ended prior snapshot from RT1 ended prior snapshot from RT2 ended I RT1 Action for transition: take snapshot take snapshot terminate RT2 T

  29. Adapting BSP programs to BASP

  30. Programs: BSP (Gluon-Sync) vs. BASP (Gluon-Async) • Asynchronous shared-memory programs can run in BASP • Resilient to stale reads • Agnostic of BSP round number Gluon-Sync Program Gluon-Async Program CuSP partitioner CuSP partitioner Galois on multicore CPU or IrGL on GPU Galois on multicore CPU or IrGL on GPU Gluon-Sync comm. runtime Gluon-Async comm. runtime Gluon-Sync termination detection Gluon-Async termination detection CuSP [IPDPS’19] Galois [SoSP’13] IrGL [OOPSLA’16] break from loop break from loop

  31. Experimental Results

  32. ExperimentalSetup • Benchmarks: • Breadth first search (bfs) • Connected components (cc) • K-core (kcore) • Pagerank (pr) • Single source shortest path (sssp)

  33. Evaluated Systems

  34. Small Input Graphs • Only used for comparison with Lux, GRAPE+, and PowerSwitch • Execute <100 BSP rounds in Gluon-Sync for all benchmarks • Not expected to gain much from asynchronous execution (even in shared-memory)

  35. Strong scaling on Bridges for small graphs(2 GPUs share a physical node) Gluon-Async is ~12x faster than Lux

  36. Friendster on 12 CPUs each with 16 cores Gluon-Async is ~2.5x and ~9.3x faster than GRAPE+ and PowerSwitch

  37. Large Input Graphs • Lux, GRAPE+, and PowerSwitch could not run • Execute >100 BSP rounds in Gluon-Sync for almost all benchmarks • Potential for asynchronous execution to perform better

  38. Speedup of Gluon-Async over Gluon-Sync:64 GPUs of Bridges Benchmarks Gluon-Async is ~1.4x faster than Gluon-Sync

  39. Speedup of Gluon-Async over Gluon-Sync:128 hosts of Stampede Benchmarks Gluon-Async is ~1.6x faster than Gluon-Sync

  40. Breakdown of execution time:wdc12 on 128 hosts of Stampede Diameter = 5274 Gluon-Async reduces idle time compared to Gluon-Sync : stragglers execute fewer rounds

  41. Conclusions • Designed a bulk-asynchronous model for distributed and heterogeneous graph analytics • Gluon-Async is ~1.5x faster than Gluon-Sync at scale • Use Gluon-Async to scale out your shared-memory graph system • Gluon-Async is publicly available in Galois v5.0 GPU IrGL/CUDA Gluon Plugin Gluon Comm. Runtime CPU Galois CPU Gluon Plugin Gluon-Async Comm. Runtime Gluon-Async Comm. Runtime CuSPPartitioner CuSPPartitioner Network (LCI/MPI) Network (LCI/MPI) http://iss.oden.utexas.edu/?p=projects/galois

  42. Backup slides

  43. Best Execution Times: Gluon-Async and Gluon-Sync

  44. Partitioning Time for CuSP Policies [CuSP IPDPS’19] Additional CuSP policies implemented in few lines of code

  45. Partitioning Quality at 128 Hosts [CuSP IPDPS’19] No single policy is fastest: depends on input and benchmark

  46. Best Partitioning Policy (clueweb12) [VLDB’18] Execution time (sec):

  47. Decision Tree [VLDB’18] % difference in execution time between policy chosen by decision tree vs. optimal

  48. State Transitions: An Example with 2 Hosts A A A A A A A A Snapshot Status on H1 Snapshot Status on H2 I I I I I I I I Number: Number: 1 2 2 0 1 0 3 3 RT1 RT1 RT1 RT1 RT1 RT1 RT1 RT1 State of H1: State of H1: RT2 RT2 RT1 RT2 RT1 RT1 State of H2: State of H2: RT2 RT1 RT1 RT2 RT1 RT1 RT2 RT2 RT2 RT2 RT2 RT2 T T T T

More Related