Phoenix : A Substrate for Resilient Distributed Graph Analytics

Roshan Dathathri Gurbinder Gill Loc Hoang Keshav Pingali Phoenix: A Substrate for Resilient Distributed Graph Analytics

Phoenix • Substrate to recover from fail-stop faults in distributed graph applications • Tolerates arbitrary number of failed machines, including cascading failures • Classifies graph algorithms and uses class-specific recovery protocol

Phoenix • Substrate to recover from fail-stop faults in distributed graph applications • Tolerates arbitrary number of failed machines, including cascading failures • Classifies graph algorithms and uses class-specific recovery protocol • No overhead in the absence of faults, unlike checkpointing • 24x faster than GraphX • Evaluated on 128 hosts using graphs 1TB • Outperforms checkpointing when up to 16 hosts fail

State of a graph Graph G C E A A B C D E F 0 G H H D F B 0 State of the graph

Distributed execution model Host h1 Host h2 Graph G C E C E G E A A CuSP [IPDPS’19] A B C D 1 E 1 0 F 0 G 1 H 1 H D F D H B B D F State of the graph A B C D E D E F G H Galois [SoSP’13] 1 1 1 1 0 0 1 1 1 1 1 1 0 compute compute communicate state transition Gluon [PLDI’18]

How to recover from crashes or fail-stop faults? Host h1 Host h2 Graph G C E G C E E A A A B C 2 1 1 D 1 E 1 0 F G H H D H D F B D F B State of the graph A B C D E D E F G H communicate 2 2 1 1 1 1 preserve re-initialize Phoenix 1 0 Fault detected during synchronization

States during algorithm execution and recovery Globally Consistent States Initial State Checkpoint-Restart Fault Phoenix Valid States Final State All States

Classification of graph algorithms  Globally Consistent States        s.t. x x Self-stabilizing algorithms Locally-correcting algorithms Valid States  x s.t.  All States Globally-consistent algorithms Globally-correcting algorithms

Classes: examples and recovery • Collaborative filtering • Belief propagation • Pull-style pagerank • Pull-style graph coloring Self-stabilizing algorithms Locally-correcting algorithms Recovery: Reinitialize lost nodes Recovery: Reinitialize lost nodes • Breadth first search • Connected components • Data-driven pagerank • Topology-driven k-core Globally-consistent algorithms Globally-correcting algorithms Recovery: ? • Betweenness centrality Recovery: Restart from last checkpoint • Residual-based pagerank • Data-driven k-core • Latent Dirichlet allocation

Problem: find k-core of an undirected graph k-core: maximal subgraph where every node has degree at least k G C E G E A H D F B H F Graph 3-core of the graph

k-core algorithm (globally-correcting) • If node is alive (1) and its degree < k, mark dead (0) and decrement neighbor’s degree 0 0 1 0 0 A 1 2 0 1 0 0 B 0 2 1 0 1 0 1 1 0 2 2 1 3 C 1 3 2 0 1 2 4 1 D E 1 5 1 1 3 5 4 1 1 1 4 F 1 3 1 4 4 1 G 1 3 3 1 3 3 1 3 3 1 1 H 1 1 3 3 G C E A H D F B Graph Algorithm execution

Phoenix recovery for k-core algorithm • Valid state: degree of every node should be the number of alive (1) neighbors • Any node can be alive (1) 0 2 A A 0 1 2 1 0 0 0 0 0 0 B B 0 1 0 0 0 1 1 1 2 1 1 2 0 1 C 1 0 C 3 2 2 1 D 3 D 1 2 1 2 0 4 4 1 1 1 1 1 1 E 5 1 3 5 4 5 E 4 4 F 1 F 1 4 1 1 4 3 1 4 1 3 G 3 G 1 3 1 1 3 3 1 3 1 1 1 1 1 H 1 3 3 3 1 H 3 1 3 3 Phoenix G C E A Fault H D F B Graph Algorithm execution

Phoenix substrate for recovery • Phoenix invoked when fail-stop fault detected • Arguments to Phoenix: depends on algorithm class • Re-initialization function • Re-computation function (globally-correcting) • Phoenix recovery: • Re-initialize and synchronize proxies • Re-compute and synchronize proxies (optional) Locally-correcting algorithm

Experimental setup • Benchmarks: • Connected components (cc) • K-core (kcore) • Pagerank (pr) • Single source shortest path (sssp) • Systems: • D-Galois • Phoenix in D-Galois • Checkpoint-Restart (CR) in D-Galois • GraphX [GRADES’13]

Wrangler: fault-free total time on 32 hosts Speedup (log scale) Geometric mean: 24x

Stampede: fault-free execution time on 128 hosts Execution Time (s) D-Galois and Phoenix are identical Geometric mean overheads: CR-50: 31% CR-500: 8%

Stampede: execution time when faults occur on 128 hosts pr on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500

Stampede: execution time overhead when faults occur • Recovery time of Phoenix is negligible • Compared to fault-free execution of Phoenix, when faults occur on 128 hosts:

Fail-stop fault-tolerant distributed graph systems

Future Work • Extend Phoenix to handle data corruption errors or byzantine faults • Use compilers to generate Phoenix recovery functions automatically • Explore Phoenix-style recovery for other application domains

Conclusion • Phoenix: substrate to recover from fail-stop faults in distributed graph applications • Recovery protocols based on classification of graph algorithms • Implemented in D-Galois, the state-of-the-art distributed graph system • Evaluated on 128 hosts using graphs 1TB • No overhead in the absence of faults, unlike checkpointing • Outperforms checkpointing when up to 16 hosts crash

Programmer effort for Phoenix • Globally-correcting kcore and pr: • 1 day of programming • 150 lines of code added (to 300 lines of code) • Locally-correcting cc and sssp: • Negligible programming effort • 30 lines of code added

Phoenix substrate for recovery: globally-correcting

Stampede: execution time when faults occur on 128 hosts cc on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500

Stampede: execution time when faults occur on 128 hosts kcore on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500

Stampede: execution time when faults occur on 128 hosts sssp on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500

Phoenix : A Substrate for Resilient Distributed Graph Analytics

Phoenix : A Substrate for Resilient Distributed Graph Analytics

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7