480 likes | 492 Views
Explore Gluon-Async, a cutting-edge Bulk-Asynchronous Parallel System for Distributed & Heterogeneous Graph Analytics. Learn about the breakthrough in asynchronous programming/model that outperforms traditional BSP systems.
E N D
Roshan Dathathri Gurbinder Gill Loc Hoang Hoang-Vu Dang VishweshJatalaV. Krishna Nandivada Marc Snir Keshav Pingali Gluon-Async: A Bulk-Asynchronous System for Distributed and Heterogeneous Graph Analytics
Graph Analytics Applications: machine learning and network analysis Datasets: unstructured graphs Need TBs of memory Credits: Wikipedia, SFL Scientific, MakeUseOf Credits: Sentinel Visualizer
Motivation • Most distributed graph analytics systems are bulk-synchronous parallel (BSP) • Gluon [PLDI’18], Lux [VLDB’18], Gemini [OSDI’16], PowerGraph [OSDI’12], ... • Dynamic workloads => stragglers limit scaling • Asynchronous distributed graph analytics systems are restricted and not competitive • GRAPE+ [SIGMOD’18], PowerSwitch[PPoPP’15], ASPIRE [OOPSLA’14], GraphLab[VLDB’12], … • Small messages => high synchronization overheads => slower than BSP systems • No way to reuse infrastructure to leverage GPUs
Bulk-Asynchronous Parallel (BASP) • Exploits resilience of graph applications to stale reads • Novel asynchronous programming and execution model • Retains the advantages of bulk-computation and bulk-communication • Allows progress by eliding blocking/waiting during communication • Novel non-blocking reconciliation of updates to the graph • Novel non-blocking termination detection algorithm • Easy to adapt BSP programs and systems to BASP
Gluon-Async: A BASP System • Adapted Gluon, the state-of-the-art distributed graph analytics system, to build Gluon-Async • First asynchronous distributed GPU graph analytics system • Supports arbitrary partitioning policies in CuSP with Gluon’s communication optimizations GPU IrGL/CUDA Gluon Plugin Gluon Comm. Runtime CPU Galois CPU Gluon Plugin Gluon-Async Comm. Gluon-Async Comm. CuSPPartitioner CuSPPartitioner Network (LCI/MPI) Network (LCI/MPI) Gluon [PLDI’18] Galois [SoSP’13] CuSP [IPDPS’19] IrGL [OOPSLA’16] LCI [IPDPS’18] • Gluon-Async is ~1.5x faster than Gluon(-Sync) at scale
Outline • Gluon Synchronization Approach • Bulk-Asynchronous Parallel (BASP) Execution Model • Adapting BSP systems to BASP • Bulk-Asynchronous Communication • Non-Blocking Termination Detection • Adapting BSP programs to BASP • Experimental Results
Vertex Programming Model • Every node has a label • e.g., distance in single source shortest path (SSSP) • Apply an operator on an active node in the graph • e.g., relaxation operator in SSSP • Push-style: reads its label and writes to neighbors’ labels • Pull-style: reads neighbors’ labels and writes to its label • Termination: no more active nodes (or work) • Applications: breadth first search, connected component, k-core, pagerank, single source shortest path, etc. R W push-style R W pull-style
Partitioning Host h1 Host h2 B C D A F G H E I J Partitions of the graph Original graph
Partitioning Host h1 • Each edge is assigned to a unique host Host h2 B C D A F G H E I J Partitions of the graph Original graph
Partitioning Host h1 • Each edge is assigned to a unique host • All edges connect proxy nodes on the same host Host h2 C B C B B C D A D A G F G F F G H E H E J J I I J Partitions of the graph Original graph
Partitioning Host h1 • Each edge is assigned to a unique host • All edges connect proxy nodes on the same host • A node can have multiple proxies: one is master proxy; rest are mirror proxies Host h2 C B C B B C D A D A G F G F F G H E H E J J I I J : Master proxy : Mirror proxy Partitions of the graph Original graph
How does Gluon synchronize the proxies? • Exploit domain knowledge • Cached copies can be stale as long as they are eventually synchronized Host h1 Host h2 1 8 B C C B 0 8 D A F G G F 1 8 8 H E J J I : Master proxy : Mirror proxy : distance (label) from source A
How does Gluon synchronize the proxies? • Exploit domain knowledge • Cached copies can be stale as long as they are eventually synchronized • Use all-reduce: • Reduce from mirror proxies to master proxy • Broadcast from master proxy to mirror proxies Host h1 Host h2 1 B C C B 0 8 1 D A F G G F 1 1 8 H E J J I : Master proxy : Mirror proxy : distance (label) from source A
Bulk-Synchronous Parallel (BSP) • Execution occurs in rounds • In each round: • Each host computes independently • Each host sends a message to every other host • Each host ingests a message from every other host • Virtual barrier at the end • BSP • Host h1 • Host h2 • Compute • Idle • Communicate
Bulk-Asynchronous Parallel (BASP) • Execution occurs in rounds • In each local round: • Each host computes independently • Each host can send messages to other hosts • Each host can ingest messages from other hosts • No waiting or blocking • BSP • Host h1 • Host h2 • BASP • Host h1 • Host h2 • Compute • Idle • Communicate
Discussion: BSP vs. BASP • BASP exploits domain knowledge • Cached copies can be stale as long as they are eventually synchronized • Advantages of BASP • Faster hosts may progress and send updated values • Straggler hosts may receive updated values => lesser computation • Communication may overlap with computation • Disadvantages of BASP • Faster hosts may use stale values => redundant computation • Host h1 • Host h2
Challenges in Realizing BASP Systems Removing barrier changes execution semantics: • How to synchronize proxies asynchronously? • How to detect termination without blocking?
Communication: BSP vs. BASP • Problem: synchronization of different proxies of the same vertex • Straightforward in BSP • Requirement: Same value as sequential at the end of a round • Achieved by all-reduce at the end of a round • More complicated in BASP • Requirement: Same value as sequential eventually • Must be achieved without blocking • Values sent in one round can be received in another
Non-Blocking Synchronization of Proxies • Consider the master proxy of vertex v on h1 and its mirror proxy on h2 • Reduction: • Mirror updated => value sent to master and reset • Master received messages => reduce on its value • Broadcast: • Master updated => value sent to mirror • Mirror received messages => reduce on its value • Master on h1 7 5 3 8 Min Min • Mirror on h2 9 3 8 7
Bulk-Asynchronous Communication • Consider two hosts: h2 has mirror proxies for master proxies on h1 • Reduction – h2 : • Aggregates all updated mirrors into one message • Sends message only if non-empty • Broadcast – h1 : • Aggregates all updated masters into one message • Sends message only if non-empty • Reduction – h1 : • May receive zero or more messages • Broadcast – h2 : • May receive zero or more messages
Termination Detection: BSP vs. BASP • Semantics: No host should terminate if there is work left • Trivial in BSP • Condition: All hosts are inactive in a round • Implementation: Use distributed accumulator (blocking collective) • More complicated in BASP • Cannot use blocking collectives • Conditions are not clear
Termination Detection Algorithm • Invoked at the end of each local round on each host • Implements a distributed consensus protocol • Does not rely on message delivery order • Hosts can directly send/receive messages to each other (clique network) • Uses a state machine on each host • Uses non-blocking collectives or snapshots to coordinate among hosts • Snapshots are numbered • A snapshot broadcasts the current state
States and Goal • Goal: a host must move to T if and only if every other host will move to T • Intuition: hosts move to T only if every host knows that “every host knows that every host wants to move to T” • Requires two RT states A States: • A: Active • I: Idle • RT1: Ready-to-Terminate1 • RT2: Ready-to-Terminate2 • T: Terminate I RT1 RT2 T
State Transitions A States: • A: Active • I: Idle • RT1: Ready-to-Terminate1 • RT2: Ready-to-Terminate2 • T: Terminate Conditions for transition: inactive active prior snapshot ended prior snapshot from RT1 ended prior snapshot from RT2 ended I RT1 Action for transition: take snapshot take snapshot terminate RT2 T
Programs: BSP (Gluon-Sync) vs. BASP (Gluon-Async) • Asynchronous shared-memory programs can run in BASP • Resilient to stale reads • Agnostic of BSP round number Gluon-Sync Program Gluon-Async Program CuSP partitioner CuSP partitioner Galois on multicore CPU or IrGL on GPU Galois on multicore CPU or IrGL on GPU Gluon-Sync comm. runtime Gluon-Async comm. runtime Gluon-Sync termination detection Gluon-Async termination detection CuSP [IPDPS’19] Galois [SoSP’13] IrGL [OOPSLA’16] break from loop break from loop
ExperimentalSetup • Benchmarks: • Breadth first search (bfs) • Connected components (cc) • K-core (kcore) • Pagerank (pr) • Single source shortest path (sssp)
Small Input Graphs • Only used for comparison with Lux, GRAPE+, and PowerSwitch • Execute <100 BSP rounds in Gluon-Sync for all benchmarks • Not expected to gain much from asynchronous execution (even in shared-memory)
Strong scaling on Bridges for small graphs(2 GPUs share a physical node) Gluon-Async is ~12x faster than Lux
Friendster on 12 CPUs each with 16 cores Gluon-Async is ~2.5x and ~9.3x faster than GRAPE+ and PowerSwitch
Large Input Graphs • Lux, GRAPE+, and PowerSwitch could not run • Execute >100 BSP rounds in Gluon-Sync for almost all benchmarks • Potential for asynchronous execution to perform better
Speedup of Gluon-Async over Gluon-Sync:64 GPUs of Bridges Benchmarks Gluon-Async is ~1.4x faster than Gluon-Sync
Speedup of Gluon-Async over Gluon-Sync:128 hosts of Stampede Benchmarks Gluon-Async is ~1.6x faster than Gluon-Sync
Breakdown of execution time:wdc12 on 128 hosts of Stampede Diameter = 5274 Gluon-Async reduces idle time compared to Gluon-Sync : stragglers execute fewer rounds
Conclusions • Designed a bulk-asynchronous model for distributed and heterogeneous graph analytics • Gluon-Async is ~1.5x faster than Gluon-Sync at scale • Use Gluon-Async to scale out your shared-memory graph system • Gluon-Async is publicly available in Galois v5.0 GPU IrGL/CUDA Gluon Plugin Gluon Comm. Runtime CPU Galois CPU Gluon Plugin Gluon-Async Comm. Runtime Gluon-Async Comm. Runtime CuSPPartitioner CuSPPartitioner Network (LCI/MPI) Network (LCI/MPI) http://iss.oden.utexas.edu/?p=projects/galois
Partitioning Time for CuSP Policies [CuSP IPDPS’19] Additional CuSP policies implemented in few lines of code
Partitioning Quality at 128 Hosts [CuSP IPDPS’19] No single policy is fastest: depends on input and benchmark
Best Partitioning Policy (clueweb12) [VLDB’18] Execution time (sec):
Decision Tree [VLDB’18] % difference in execution time between policy chosen by decision tree vs. optimal
State Transitions: An Example with 2 Hosts A A A A A A A A Snapshot Status on H1 Snapshot Status on H2 I I I I I I I I Number: Number: 1 2 2 0 1 0 3 3 RT1 RT1 RT1 RT1 RT1 RT1 RT1 RT1 State of H1: State of H1: RT2 RT2 RT1 RT2 RT1 RT1 State of H2: State of H2: RT2 RT1 RT1 RT2 RT1 RT1 RT2 RT2 RT2 RT2 RT2 RT2 T T T T