1 / 26

Phoenix : A Substrate for Resilient Distributed Graph Analytics

Phoenix is a substrate that enables recovery from fail-stop faults in distributed graph applications. It tolerates any number of failed machines and includes a class-specific recovery protocol for different graph algorithms. Phoenix is 24x faster than GraphX and has no overhead in the absence of faults.

weatherly
Download Presentation

Phoenix : A Substrate for Resilient Distributed Graph Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Roshan Dathathri          Gurbinder Gill Loc Hoang         Keshav Pingali Phoenix: A Substrate for Resilient Distributed Graph Analytics

  2. Phoenix • Substrate to recover from fail-stop faults in distributed graph applications • Tolerates arbitrary number of failed machines, including cascading failures • Classifies graph algorithms and uses class-specific recovery protocol

  3. Phoenix • Substrate to recover from fail-stop faults in distributed graph applications • Tolerates arbitrary number of failed machines, including cascading failures • Classifies graph algorithms and uses class-specific recovery protocol • No overhead in the absence of faults, unlike checkpointing • 24x faster than GraphX • Evaluated on 128 hosts using graphs      1TB • Outperforms checkpointing when up to 16 hosts fail

  4. State of a graph Graph G C E A A B C D E F 0 G H H D F B 0 State of the graph

  5. Distributed execution model Host h1 Host h2 Graph G C E C E G E A A CuSP [IPDPS’19] A B C D 1 E 1 0 F 0 G 1 H 1 H D F D H B B D F State of the graph A B C D E D E F G H Galois [SoSP’13] 1 1 1 1 0 0 1 1 1 1 1 1 0 compute compute communicate state transition Gluon [PLDI’18]

  6. How to recover from crashes or fail-stop faults? Host h1 Host h2 Graph G C E G C E E A A A B C 2 1 1 D 1 E 1 0 F G H H D H D F B D F B State of the graph A B C D E D E F G H communicate 2 2 1 1 1 1 preserve re-initialize Phoenix 1 0 Fault detected during synchronization

  7. States during algorithm execution and recovery Globally Consistent States Initial State Checkpoint-Restart Fault Phoenix Valid States Final State All States

  8. Classification of graph algorithms  Globally Consistent States        s.t. x x Self-stabilizing algorithms Locally-correcting algorithms Valid States  x s.t.  All States Globally-consistent algorithms Globally-correcting algorithms

  9. Classes: examples and recovery • Collaborative filtering • Belief propagation • Pull-style pagerank • Pull-style graph coloring Self-stabilizing algorithms Locally-correcting algorithms Recovery: Reinitialize lost nodes Recovery: Reinitialize lost nodes • Breadth first search • Connected components • Data-driven pagerank • Topology-driven k-core Globally-consistent algorithms Globally-correcting algorithms Recovery: ? • Betweenness centrality Recovery: Restart from last checkpoint • Residual-based pagerank • Data-driven k-core • Latent Dirichlet allocation

  10. Problem: find k-core of an undirected graph k-core: maximal subgraph where every node has degree at least k G C E G E A H D F B H F Graph 3-core of the graph

  11. k-core algorithm (globally-correcting) • If node is alive (1) and its degree < k, mark dead (0) and decrement neighbor’s degree 0 0 1 0 0 A 1 2 0 1 0 0 B 0 2 1 0 1 0 1 1 0 2 2 1 3 C 1 3 2 0 1 2 4 1 D E 1 5 1 1 3 5 4 1 1 1 4 F 1 3 1 4 4 1 G 1 3 3 1 3 3 1 3 3 1 1 H 1 1 3 3 G C E A H D F B Graph Algorithm execution

  12. Phoenix recovery for k-core algorithm • Valid state: degree of every node should be the number of alive (1) neighbors • Any node can be alive (1) 0 2 A A 0 1 2 1 0 0 0 0 0 0 B B 0 1 0 0 0 1 1 1 2 1 1 2 0 1 C 1 0 C 3 2 2 1 D 3 D 1 2 1 2 0 4 4 1 1 1 1 1 1 E 5 1 3 5 4 5 E 4 4 F 1 F 1 4 1 1 4 3 1 4 1 3 G 3 G 1 3 1 1 3 3 1 3 1 1 1 1 1 H 1 3 3 3 1 H 3 1 3 3 Phoenix G C E A Fault H D F B Graph Algorithm execution

  13. Phoenix substrate for recovery • Phoenix invoked when fail-stop fault detected • Arguments to Phoenix: depends on algorithm class • Re-initialization function • Re-computation function (globally-correcting) • Phoenix recovery: • Re-initialize and synchronize proxies • Re-compute and synchronize proxies (optional) Locally-correcting algorithm

  14. Experimental setup • Benchmarks: • Connected components (cc) • K-core (kcore) • Pagerank (pr) • Single source shortest path (sssp) • Systems: • D-Galois • Phoenix in D-Galois • Checkpoint-Restart (CR) in D-Galois • GraphX [GRADES’13]

  15. Wrangler: fault-free total time on 32 hosts Speedup (log scale) Geometric mean: 24x

  16. Stampede: fault-free execution time on 128 hosts Execution Time (s) D-Galois and Phoenix are identical Geometric mean overheads: CR-50: 31%                   CR-500: 8%

  17. Stampede: execution time when faults occur on 128 hosts pr on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500

  18. Stampede: execution time overhead when faults occur • Recovery time of Phoenix is negligible • Compared to fault-free execution of Phoenix, when faults occur on 128 hosts:

  19. Fail-stop fault-tolerant distributed graph systems

  20. Future Work • Extend Phoenix to handle data corruption errors or byzantine faults • Use compilers to generate Phoenix recovery functions automatically • Explore Phoenix-style recovery for other application domains

  21. Conclusion • Phoenix: substrate to recover from fail-stop faults in distributed graph applications • Recovery protocols based on classification of graph algorithms • Implemented in D-Galois, the state-of-the-art distributed graph system • Evaluated on 128 hosts using graphs 1TB • No overhead in the absence of faults, unlike checkpointing • Outperforms checkpointing when up to 16 hosts crash

  22. Programmer effort for Phoenix • Globally-correcting kcore and pr: • 1 day of programming • 150 lines of code added (to 300 lines of code) • Locally-correcting cc and sssp: • Negligible programming effort • 30 lines of code added

  23. Phoenix substrate for recovery: globally-correcting

  24. Stampede: execution time when faults occur on 128 hosts cc on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500

  25. Stampede: execution time when faults occur on 128 hosts kcore on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500

  26. Stampede: execution time when faults occur on 128 hosts sssp on wdc12 Speedup of Phoenix over CR-50 Speedup of Phoenix over CR-500

More Related