670 likes | 791 Views
Towards Scalable Checkpointing for Supercomputing Applications. Alison N. Norman, Ph.D. The University of Texas at Austin December 2, 2010. Supercomputers. Large-scale systems composed of thousands of processors. Supercomputers are Useful.
E N D
Towards Scalable Checkpointing for Supercomputing Applications Alison N. Norman, Ph.D. The University of Texas at Austin December 2, 2010
Supercomputers Large-scale systems composed of thousands of processors
Supercomputers are Useful Enable the execution of long-running large-scale parallel applications • Use many processes • Study many things • Climate change • Protein folding • Oil spills in the gulf
Supercomputers have Weaknesses • Experience failures • Many types of failure: processor, memory, network, application I/O • Failures are frequent enough that they affect the long-running large-scale applications • Applications stop prematurely
How Failures Impact an Application X 0 Processes X 1 … 64K X Time
Ensuring Progress • Applicationscheckpoint, or save their state, for use during recovery • Each process in the application checkpoints • A set of checkpoints, one per process, forms the application checkpoint • Checkpoints are used to resume the application if needed
How Checkpoints Help checkpoint X 0 Processes 1 X … 64K Time
Challenge • Application is many processes • Each process saves its own checkpoint • Saved state must be consistent across the application • Must be a state that could have existed
0 1 When is a state not consistent? State could have existed State could not have existed Send not saved Message sent from Process 0 to Process 2 Processes Receive is saved Receive not saved … 64K Send is saved Time
Current Practice • Programmers place checkpoints • Synchronous checkpointing • Processes synchronize and then checkpoint • No in-flight communication • Advantages • Ensures correct state • Leverages programmer knowledge • Two Problems…
Problem One: Programmer Burden • Programmers are trained in other fields • Physics, Chemistry, Geology, … • Should not waste time and talent implementing checkpointing Checkpointing MPI FORTRAN/C/C++ Algorithm
In addition… • Programmers want automatic checkpointing • 65% of Texas Advanced Computing Center (TACC) users responding to a survey want automatic checkpointing [1] • Only 2.5% are currently using automatic checkpointing Goal One: Reduce programmer burden by providing checkpointing support in the compiler [1] TACC User Survey, 2009
Problem Two:Checkpoint Overhead • Checkpoints are written to the global file system • Causes contention • Checkpointing hurts application performance • 30 minutes for tens of thousands of processes to checkpoint on BG/L [Liang06] Goal Two: Reduce file system contention by separating process checkpoints in time while guaranteeing a correct state
Our Solution: Compiler-Assisted Staggered Checkpointing • Places staggered checkpoints in application text • Statically guarantees a correct state • Reduces programmer burden by identifying the locations where each process should checkpoint • Reduces file system contention by separating individual checkpoints in time
0 1 Our Algorithm:Terminology Checkpoint Location State could have existed State could not have existed Send not saved VALID Recovery line Processes Recovery line[Randall 75] Receive is saved Receive not saved … 64K Send is saved Time
Algorithm: Three Steps • Determine communication • Identify inter-process dependences • Generate recovery lines
Algorithmic Assumptions • Communication is explicit at compile-time • Number of processes is known at compile-time
Step One: Determine Communication Find the neighbors for each process: p = sqrt(no_nodes) cell_coord[0][0] = node % p cell_coord[1][0] = node / p; j = cell_coord[0][0] - 1 i = cell_coord[1][0] - 1 from_process = (i – 1 + p) % p + p * j MPI_irecv(x, x, x, from_process, …) (taken from NAS benchmark BT) from_process = (node / (sqrt(no_nodes)) - 1 - 1 + sqrt(no_nodes)) % sqrt(no_nodes) + sqrt(no_nodes) * node % sqrt(no_nodes) - 1
0 1 2 Step One:Determine Communication Processes Time
0 [1,0,0] [2,0,0] [3,2,0] [4,5,2] 1 [2,3,2] [2,4,2] [1,1,0] [1,2,0] [2,5,2] 2 [2,0,1] [2,0,2] [2,4,3] Step Two:Identify Dependences [P0,P1,P2] Track events within a process Vector Clocks: capture inter-process dependences [1,0,0] [1] [2] [3] [4] Processes [3] [4] [1] [2] [5] [1,1,0] [1] [2] [3] [Lamport 78] Time
0 1 2 Step Three: Identify Recovery Lines How do we generate and evaluate valid recovery lines? Processes Time
Recovery Line Algorithm • Finds valid lines first, then evaluates for staggering • What does a recovery line look like? • A set of {process, checkpoint location} pairs, one for each process • Each pair must be valid with every other pair
Naïve Algorithm • Depth-first search with backtracking • Example: four processes, three checkpoint locations This doesn’t work for large numbers of processes! Why? {P0,L0} , {P1,L0} , {P2,L0} , {P2,L1} , {P2,L2} , {P3,L2} , {P3,L1} , {P3,L0} {P0,L0} , {P1,L0} , {P2,L0} , {P3,L1} Valid Recovery Lines {P0,L0} , {P1,L0} , {P2,L0} , {P3,L0}
Search Space is Huge and Sparse • Search space is sized with LP • L is the number of checkpoint locations • P is the number of processes • For the BT benchmark using 16 processes, there are 3816 possible recovery lines • Valid recovery lines are very small amount of the total • BT with 16 processes has 191 valid recovery lines • Makes finding valid recovery lines hard
Generating Recovery Lines: Our Algorithm Only eliminate invalid lines • Reduces the L in LP • Introduces a new basic algorithm • Replaces naïve algorithm • Reduces the P in LP • Introduces a heuristic Eliminate both invalid and valid lines
Reducing L: Phases • Valid recovery lines cannot cross process synchronization points • Barriers • Collective communication • Structure algorithm to search in phases, or the intervals between these points • For the BT benchmark using 16 processes, there are now 1416 possible recovery lines
Reducing L: Merges • Some checkpoint locations are separated by few local operations • Minimizes the benefit of staggered checkpointing • Merge those not separated by a minimum number of operations
But… • The naïve algorithm still doesn’t work • Why? Redundant work! Line: {P0, L0} , {P1,L2} , {P1,L1} , {P2,L2} Whoops! {P2,L2} is incompatible with {P0,L0} Backtracking... We already knew this!
New Basic Algorithm • Constraint-Satisfaction Solution Synthesis [Tsang93] • Eliminates redundant work through upward-propagation of constraints • Creates complete valid recovery lines • Reduces complexity to O(log(P)*LP-2) How?
P3 L0, L2, L3 Basic Algorithm: Example {{P1, L0}, {P2, L0}, {P0, L0}, {P3, L0}, {P5, L0}, {P6, L0}, {P4, L0}, {P7, L0}} {{P1, L1}, {P2, L1}, {P0, L0}, {P3, L2}, {P5, L2}, {P6, L1}, {P4, L0}, {P7, L2}} 1, 2, 0, 3, 5, 6, 4, 7 {{P1, L0}, {P2, L0}, {P0, L0}, {P3, L0}} {{P1, L2}, {P2, L1}, {P0, L0}, {P3, L2}} 1, 2, 0, 3 5, 6, 4, 7 {{P0, L0}, {P3, L0}} {{P0, L0}, {P3, L2}} {{P0, L1}, {P3, L0}} Bad News: Algorithm still scales with the number of processes Good News: No redundant work is performed! 1, 2 0, 3 5, 6 4, 7 1 2 0 3 5 6 4 7 Each block represents a process Partial valid recovery lines are formed from the partial valid recovery lines previously formed by each source partition Processes are merged into partitions, and partial valid recovery lines are formed This last partition forms complete valid recovery lines
Generating Recovery Lines: Our Algorithm • Reduces the L in LP • Introduces a new basic algorithm • Replaces naïve algorithm • Reduces the P in LP • Introduces a heuristic
Reducing P • Observation: # of checkpoint locations « # of processes • So multiple processes must be checkpointing at the same location Let’s be smart about which processes checkpoint together!
Reducing P: Clumps • Clumps are processes that checkpoint together • Formed from processes that communicate between consecutive checkpoint locations • Algorithm treats a clump the same way it treats a process • Processes can belong to more than one clump • Complexity is: O(log(C)*LC-2), where C represents the number of clumps
0 1 2 3 Time Clumps Example 0 1 2 3 For the BT benchmark using 16 processes, there are now 512+21 possible recovery lines Checkpoint Location 0 1 2 3 Processes 1 2 3 1 2 3 Clump Processes Locations a 0,1 0,1 ,2,3 b 2,3 2,3 c 0,1,2,3 1,2
Reducing P: Clump Sets • Clumps are combined into clump sets • Each process is represented exactly once in each clump set • So when all clumps in a set are placed, a complete recovery line is formed • Complexity remains O(log(C)*LC-2), but C now represent the clumps in a clump set For the BT benchmark using 16 processes, there are now 3(54)+21 possible recovery lines Clump Processes a 0,1 b 2,3 c 0,1,2,3
Algorithm Summary For the 1,024 process case, we have reduced the search space by over 1,500 orders of magnitude. However, there are still 69 septillion potential lines. There is more work to be done… Results for the BT benchmark
Aggressively Reducing the Search Space • Prune preliminary results using branch-and-bound • Uses our Wide-and-Flat (WAF) metric • Statically estimates which lines are more staggered • Combines the interval of checkpoint locations (width) and the number of processes that checkpoint at each (flatness) • Works with partial and complete lines • Difficult since there is direct tension between valid and staggered recovery lines
Algorithm Summary • Algorithm scales to large numbers of processes • Scales with clumps not processes • Successfully identifies lines in applications using up to 65,536 processes
Evaluation by Simulation Compiler Translator Simulator FS • Inside Broadway[Guyer00], a source-to-source compiler • Accepts C code, generates C code with checkpoints • Assumptions • Communication neighbors only dependent on: • Node rank • Number of nodes in the system • Other constants • MPI used for communication Event-driven Approximates local, communication, and checkpointing operations using models Averages 83% accuracy Translates application source code to traces Uses static analysis and profiling
Supercomputers • Ranger, a system at TACC • Lustre file system with 40 GB/s throughput • 4 2.3 GHZ processors with 4 cores each per machine • 62,976 cores total • Newer, more modern, currently 11 in Top 500 • Experimental results for 16 processes/machine • Lonestar, also a system at TACC • Lustre file system with 4.6 GB/s throughput • 2 2.66 GHZ processors with 2 cores each per node • 5,840 cores total • Currently 123 in Top 500 • Experimental results for 1 process/node
WAF Metric: Evaluation Results for the BT Benchmark with 4,096 processes
Evaluation of Identified Lines • Lines placed by our algorithm improve checkpointing performance • Average 26% with a three minute interval • Average 44% with a fifteen minute interval • Total execution time is basically unchanged • 7% average improvement is statistically insignificant in our simulator • Penalty for disrupting communication
Conclusions • Staggered checkpointing can improve checkpointing performance • Our algorithm is scalable and works successfully for applications up to 65,536 processes • Lines places by our algorithm improve checkpointing performance by an average of 35% • Larger time intervals in which to stagger lead to larger improvement
Future Work • Extend the WAF metric • Account for checkpoint size • Consider system characteristics • Consider communication disruption • Link with a checkpointing tool • Actually take checkpoints! • Reduce size through compression • Design and implement recovery algorithm
Special Thanks to: Calvin Lin Sung-Eun Choi Texas Advanced Computing Center Thank you!
Crash Send Omission Receive Omission General Omission Arbitrary failures with message authentication Arbitrary (Byzantine) failures Fault Model
Options forFault-Tolerance • Redundancy in space • Each participating process has a backup process • Expensive! • Redundancy in time • Processes save state and then rollback for recovery • Lighter-weight fault tolerance