260 likes | 271 Views
Phoenix is a substrate that enables recovery from fail-stop faults in distributed graph applications. It tolerates any number of failed machines and includes a class-specific recovery protocol for different graph algorithms. Phoenix is 24x faster than GraphX and has no overhead in the absence of faults.
E N D
Roshan Dathathri Gurbinder Gill Loc Hoang Keshav Pingali Phoenix: A Substrate for Resilient Distributed Graph Analytics
Phoenix • Substrate to recover from fail-stop faults in distributed graph applications • Tolerates arbitrary number of failed machines, including cascading failures • Classifies graph algorithms and uses class-specific recovery protocol
Phoenix • Substrate to recover from fail-stop faults in distributed graph applications • Tolerates arbitrary number of failed machines, including cascading failures • Classifies graph algorithms and uses class-specific recovery protocol • No overhead in the absence of faults, unlike checkpointing • 24x faster than GraphX • Evaluated on 128 hosts using graphs 1TB • Outperforms checkpointing when up to 16 hosts fail
State of a graph Graph G C E A A B C D E F 0 G H H D F B 0 State of the graph
Distributed execution model Host h1 Host h2 Graph G C E C E G E A A CuSP [IPDPS’19] A B C D 1 E 1 0 F 0 G 1 H 1 H D F D H B B D F State of the graph A B C D E D E F G H Galois [SoSP’13] 1 1 1 1 0 0 1 1 1 1 1 1 0 compute compute communicate state transition Gluon [PLDI’18]
How to recover from crashes or fail-stop faults? Host h1 Host h2 Graph G C E G C E E A A A B C 2 1 1 D 1 E 1 0 F G H H D H D F B D F B State of the graph A B C D E D E F G H communicate 2 2 1 1 1 1 preserve re-initialize Phoenix 1 0 Fault detected during synchronization
States during algorithm execution and recovery Globally Consistent States Initial State Checkpoint-Restart Fault Phoenix Valid States Final State All States
Classification of graph algorithms Globally Consistent States s.t. x x Self-stabilizing algorithms Locally-correcting algorithms Valid States x s.t. All States Globally-consistent algorithms Globally-correcting algorithms
Classes: examples and recovery • Collaborative filtering • Belief propagation • Pull-style pagerank • Pull-style graph coloring Self-stabilizing algorithms Locally-correcting algorithms Recovery: Reinitialize lost nodes Recovery: Reinitialize lost nodes • Breadth first search • Connected components • Data-driven pagerank • Topology-driven k-core Globally-consistent algorithms Globally-correcting algorithms Recovery: ? • Betweenness centrality Recovery: Restart from last checkpoint • Residual-based pagerank • Data-driven k-core • Latent Dirichlet allocation
Problem: find k-core of an undirected graph k-core: maximal subgraph where every node has degree at least k G C E G E A H D F B H F Graph 3-core of the graph
k-core algorithm (globally-correcting) • If node is alive (1) and its degree < k, mark dead (0) and decrement neighbor’s degree 0 0 1 0 0 A 1 2 0 1 0 0 B 0 2 1 0 1 0 1 1 0 2 2 1 3 C 1 3 2 0 1 2 4 1 D E 1 5 1 1 3 5 4 1 1 1 4 F 1 3 1 4 4 1 G 1 3 3 1 3 3 1 3 3 1 1 H 1 1 3 3 G C E A H D F B Graph Algorithm execution
Phoenix recovery for k-core algorithm • Valid state: degree of every node should be the number of alive (1) neighbors • Any node can be alive (1) 0 2 A A 0 1 2 1 0 0 0 0 0 0 B B 0 1 0 0 0 1 1 1 2 1 1 2 0 1 C 1 0 C 3 2 2 1 D 3 D 1 2 1 2 0 4 4 1 1 1 1 1 1 E 5 1 3 5 4 5 E 4 4 F 1 F 1 4 1 1 4 3 1 4 1 3 G 3 G 1 3 1 1 3 3 1 3 1 1 1 1 1 H 1 3 3 3 1 H 3 1 3 3 Phoenix G C E A Fault H D F B Graph Algorithm execution
Phoenix substrate for recovery • Phoenix invoked when fail-stop fault detected • Arguments to Phoenix: depends on algorithm class • Re-initialization function • Re-computation function (globally-correcting) • Phoenix recovery: • Re-initialize and synchronize proxies • Re-compute and synchronize proxies (optional) Locally-correcting algorithm
Experimental setup • Benchmarks: • Connected components (cc) • K-core (kcore) • Pagerank (pr) • Single source shortest path (sssp) • Systems: • D-Galois • Phoenix in D-Galois • Checkpoint-Restart (CR) in D-Galois • GraphX [GRADES’13]
Wrangler: fault-free total time on 32 hosts Speedup (log scale) Geometric mean: 24x
Stampede: fault-free execution time on 128 hosts Execution Time (s) D-Galois and Phoenix are identical Geometric mean overheads: CR-50: 31% CR-500: 8%
Stampede: execution time when faults occur on 128 hosts pr on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500
Stampede: execution time overhead when faults occur • Recovery time of Phoenix is negligible • Compared to fault-free execution of Phoenix, when faults occur on 128 hosts:
Future Work • Extend Phoenix to handle data corruption errors or byzantine faults • Use compilers to generate Phoenix recovery functions automatically • Explore Phoenix-style recovery for other application domains
Conclusion • Phoenix: substrate to recover from fail-stop faults in distributed graph applications • Recovery protocols based on classification of graph algorithms • Implemented in D-Galois, the state-of-the-art distributed graph system • Evaluated on 128 hosts using graphs 1TB • No overhead in the absence of faults, unlike checkpointing • Outperforms checkpointing when up to 16 hosts crash
Programmer effort for Phoenix • Globally-correcting kcore and pr: • 1 day of programming • 150 lines of code added (to 300 lines of code) • Locally-correcting cc and sssp: • Negligible programming effort • 30 lines of code added
Stampede: execution time when faults occur on 128 hosts cc on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500
Stampede: execution time when faults occur on 128 hosts kcore on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500
Stampede: execution time when faults occur on 128 hosts sssp on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500