1 / 67

Towards Scalable Checkpointing for Supercomputing Applications

Towards Scalable Checkpointing for Supercomputing Applications. Alison N. Norman, Ph.D. The University of Texas at Austin December 2, 2010. Supercomputers. Large-scale systems composed of thousands of processors. Supercomputers are Useful.

Download Presentation

Towards Scalable Checkpointing for Supercomputing Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Scalable Checkpointing for Supercomputing Applications Alison N. Norman, Ph.D. The University of Texas at Austin December 2, 2010

  2. Supercomputers Large-scale systems composed of thousands of processors

  3. Supercomputers are Useful Enable the execution of long-running large-scale parallel applications • Use many processes • Study many things • Climate change • Protein folding • Oil spills in the gulf

  4. Supercomputers have Weaknesses • Experience failures • Many types of failure: processor, memory, network, application I/O • Failures are frequent enough that they affect the long-running large-scale applications • Applications stop prematurely

  5. How Failures Impact an Application X 0 Processes X 1 … 64K X Time

  6. Ensuring Progress • Applicationscheckpoint, or save their state, for use during recovery • Each process in the application checkpoints • A set of checkpoints, one per process, forms the application checkpoint • Checkpoints are used to resume the application if needed

  7. How Checkpoints Help checkpoint X 0  Processes 1 X  … 64K  Time

  8. Challenge • Application is many processes • Each process saves its own checkpoint • Saved state must be consistent across the application • Must be a state that could have existed

  9. 0 1 When is a state not consistent? State could have existed State could not have existed Send not saved Message sent from Process 0 to Process 2 Processes Receive is saved Receive not saved … 64K Send is saved Time

  10. Current Practice • Programmers place checkpoints • Synchronous checkpointing • Processes synchronize and then checkpoint • No in-flight communication • Advantages • Ensures correct state • Leverages programmer knowledge • Two Problems…

  11. Problem One: Programmer Burden • Programmers are trained in other fields • Physics, Chemistry, Geology, … • Should not waste time and talent implementing checkpointing Checkpointing MPI FORTRAN/C/C++ Algorithm

  12. In addition… • Programmers want automatic checkpointing • 65% of Texas Advanced Computing Center (TACC) users responding to a survey want automatic checkpointing [1] • Only 2.5% are currently using automatic checkpointing Goal One: Reduce programmer burden by providing checkpointing support in the compiler [1] TACC User Survey, 2009

  13. Problem Two:Checkpoint Overhead • Checkpoints are written to the global file system • Causes contention • Checkpointing hurts application performance • 30 minutes for tens of thousands of processes to checkpoint on BG/L [Liang06] Goal Two: Reduce file system contention by separating process checkpoints in time while guaranteeing a correct state

  14. Our Solution: Compiler-Assisted Staggered Checkpointing • Places staggered checkpoints in application text • Statically guarantees a correct state • Reduces programmer burden by identifying the locations where each process should checkpoint • Reduces file system contention by separating individual checkpoints in time

  15. 0 1 Our Algorithm:Terminology Checkpoint Location State could have existed State could not have existed Send not saved VALID Recovery line Processes Recovery line[Randall 75] Receive is saved Receive not saved … 64K Send is saved Time

  16. Algorithm: Three Steps • Determine communication • Identify inter-process dependences • Generate recovery lines

  17. Algorithmic Assumptions • Communication is explicit at compile-time • Number of processes is known at compile-time

  18. Step One: Determine Communication Find the neighbors for each process: p = sqrt(no_nodes) cell_coord[0][0] = node % p cell_coord[1][0] = node / p; j = cell_coord[0][0] - 1 i = cell_coord[1][0] - 1 from_process = (i – 1 + p) % p + p * j MPI_irecv(x, x, x, from_process, …) (taken from NAS benchmark BT) from_process = (node / (sqrt(no_nodes)) - 1 - 1 + sqrt(no_nodes)) % sqrt(no_nodes) + sqrt(no_nodes) * node % sqrt(no_nodes) - 1

  19. 0 1 2 Step One:Determine Communication Processes Time

  20. 0 [1,0,0] [2,0,0] [3,2,0] [4,5,2] 1 [2,3,2] [2,4,2] [1,1,0] [1,2,0] [2,5,2] 2 [2,0,1] [2,0,2] [2,4,3] Step Two:Identify Dependences [P0,P1,P2] Track events within a process Vector Clocks: capture inter-process dependences [1,0,0] [1] [2] [3] [4] Processes [3] [4] [1] [2] [5] [1,1,0] [1] [2] [3] [Lamport 78] Time

  21. 0 1 2 Step Three: Identify Recovery Lines How do we generate and evaluate valid recovery lines? Processes Time

  22. Recovery Line Algorithm • Finds valid lines first, then evaluates for staggering • What does a recovery line look like? • A set of {process, checkpoint location} pairs, one for each process • Each pair must be valid with every other pair

  23. Naïve Algorithm • Depth-first search with backtracking • Example: four processes, three checkpoint locations This doesn’t work for large numbers of processes! Why? {P0,L0} , {P1,L0} , {P2,L0} , {P2,L1} , {P2,L2} , {P3,L2} , {P3,L1} , {P3,L0} {P0,L0} , {P1,L0} , {P2,L0} , {P3,L1} Valid Recovery Lines {P0,L0} , {P1,L0} , {P2,L0} , {P3,L0}

  24. Search Space is Huge and Sparse • Search space is sized with LP • L is the number of checkpoint locations • P is the number of processes • For the BT benchmark using 16 processes, there are 3816 possible recovery lines • Valid recovery lines are very small amount of the total • BT with 16 processes has 191 valid recovery lines • Makes finding valid recovery lines hard

  25. Generating Recovery Lines: Our Algorithm Only eliminate invalid lines • Reduces the L in LP • Introduces a new basic algorithm • Replaces naïve algorithm • Reduces the P in LP • Introduces a heuristic Eliminate both invalid and valid lines

  26. Reducing L: Phases • Valid recovery lines cannot cross process synchronization points • Barriers • Collective communication • Structure algorithm to search in phases, or the intervals between these points • For the BT benchmark using 16 processes, there are now 1416 possible recovery lines

  27. Reducing L: Merges • Some checkpoint locations are separated by few local operations • Minimizes the benefit of staggered checkpointing • Merge those not separated by a minimum number of operations

  28. But… • The naïve algorithm still doesn’t work • Why? Redundant work! Line: {P0, L0} , {P1,L2} , {P1,L1} , {P2,L2} Whoops! {P2,L2} is incompatible with {P0,L0} Backtracking... We already knew this!

  29. New Basic Algorithm • Constraint-Satisfaction Solution Synthesis [Tsang93] • Eliminates redundant work through upward-propagation of constraints • Creates complete valid recovery lines • Reduces complexity to O(log(P)*LP-2) How?

  30. P3 L0, L2, L3 Basic Algorithm: Example {{P1, L0}, {P2, L0}, {P0, L0}, {P3, L0}, {P5, L0}, {P6, L0}, {P4, L0}, {P7, L0}} {{P1, L1}, {P2, L1}, {P0, L0}, {P3, L2}, {P5, L2}, {P6, L1}, {P4, L0}, {P7, L2}} 1, 2, 0, 3, 5, 6, 4, 7 {{P1, L0}, {P2, L0}, {P0, L0}, {P3, L0}} {{P1, L2}, {P2, L1}, {P0, L0}, {P3, L2}} 1, 2, 0, 3 5, 6, 4, 7 {{P0, L0}, {P3, L0}} {{P0, L0}, {P3, L2}} {{P0, L1}, {P3, L0}} Bad News: Algorithm still scales with the number of processes Good News: No redundant work is performed! 1, 2 0, 3 5, 6 4, 7 1 2 0 3 5 6 4 7 Each block represents a process Partial valid recovery lines are formed from the partial valid recovery lines previously formed by each source partition Processes are merged into partitions, and partial valid recovery lines are formed This last partition forms complete valid recovery lines

  31. Generating Recovery Lines: Our Algorithm • Reduces the L in LP • Introduces a new basic algorithm • Replaces naïve algorithm • Reduces the P in LP • Introduces a heuristic

  32. Reducing P • Observation: # of checkpoint locations « # of processes • So multiple processes must be checkpointing at the same location Let’s be smart about which processes checkpoint together!

  33. Reducing P: Clumps • Clumps are processes that checkpoint together • Formed from processes that communicate between consecutive checkpoint locations • Algorithm treats a clump the same way it treats a process • Processes can belong to more than one clump • Complexity is: O(log(C)*LC-2), where C represents the number of clumps

  34. 0 1 2 3 Time Clumps Example 0 1 2 3 For the BT benchmark using 16 processes, there are now 512+21 possible recovery lines Checkpoint Location 0 1 2 3 Processes 1 2 3 1 2 3 Clump Processes Locations a 0,1 0,1 ,2,3 b 2,3 2,3 c 0,1,2,3 1,2

  35. Reducing P: Clump Sets • Clumps are combined into clump sets • Each process is represented exactly once in each clump set • So when all clumps in a set are placed, a complete recovery line is formed • Complexity remains O(log(C)*LC-2), but C now represent the clumps in a clump set For the BT benchmark using 16 processes, there are now 3(54)+21 possible recovery lines Clump Processes a 0,1 b 2,3 c 0,1,2,3

  36. Algorithm Summary For the 1,024 process case, we have reduced the search space by over 1,500 orders of magnitude. However, there are still 69 septillion potential lines. There is more work to be done… Results for the BT benchmark

  37. Aggressively Reducing the Search Space • Prune preliminary results using branch-and-bound • Uses our Wide-and-Flat (WAF) metric • Statically estimates which lines are more staggered • Combines the interval of checkpoint locations (width) and the number of processes that checkpoint at each (flatness) • Works with partial and complete lines • Difficult since there is direct tension between valid and staggered recovery lines

  38. Pruning Policy Results BT

  39. Algorithm Summary • Algorithm scales to large numbers of processes • Scales with clumps not processes • Successfully identifies lines in applications using up to 65,536 processes

  40. Evaluation by Simulation Compiler Translator Simulator FS • Inside Broadway[Guyer00], a source-to-source compiler • Accepts C code, generates C code with checkpoints • Assumptions • Communication neighbors only dependent on: • Node rank • Number of nodes in the system • Other constants • MPI used for communication Event-driven Approximates local, communication, and checkpointing operations using models Averages 83% accuracy Translates application source code to traces Uses static analysis and profiling

  41. Supercomputers • Ranger, a system at TACC • Lustre file system with 40 GB/s throughput • 4 2.3 GHZ processors with 4 cores each per machine • 62,976 cores total • Newer, more modern, currently 11 in Top 500 • Experimental results for 16 processes/machine • Lonestar, also a system at TACC • Lustre file system with 4.6 GB/s throughput • 2 2.66 GHZ processors with 2 cores each per node • 5,840 cores total • Currently 123 in Top 500 • Experimental results for 1 process/node

  42. WAF Metric: Evaluation Results for the BT Benchmark with 4,096 processes

  43. Evaluation of Identified Lines • Lines placed by our algorithm improve checkpointing performance • Average 26% with a three minute interval • Average 44% with a fifteen minute interval • Total execution time is basically unchanged • 7% average improvement is statistically insignificant in our simulator • Penalty for disrupting communication

  44. Conclusions • Staggered checkpointing can improve checkpointing performance • Our algorithm is scalable and works successfully for applications up to 65,536 processes • Lines places by our algorithm improve checkpointing performance by an average of 35% • Larger time intervals in which to stagger lead to larger improvement

  45. Future Work • Extend the WAF metric • Account for checkpoint size • Consider system characteristics • Consider communication disruption • Link with a checkpointing tool • Actually take checkpoints! • Reduce size through compression • Design and implement recovery algorithm

  46. Special Thanks to: Calvin Lin Sung-Eun Choi Texas Advanced Computing Center Thank you!

  47. Thank you!

  48. Backup Slides

  49. Crash Send Omission Receive Omission General Omission Arbitrary failures with message authentication Arbitrary (Byzantine) failures Fault Model

  50. Options forFault-Tolerance • Redundancy in space • Each participating process has a backup process • Expensive! • Redundancy in time • Processes save state and then rollback for recovery • Lighter-weight fault tolerance

More Related