Towards Scalable Checkpointing for Supercomputing Applications

Towards Scalable Checkpointing for Supercomputing Applications Alison N. Norman, Ph.D. The University of Texas at Austin December 2, 2010

Supercomputers Large-scale systems composed of thousands of processors

Supercomputers are Useful Enable the execution of long-running large-scale parallel applications • Use many processes • Study many things • Climate change • Protein folding • Oil spills in the gulf

Supercomputers have Weaknesses • Experience failures • Many types of failure: processor, memory, network, application I/O • Failures are frequent enough that they affect the long-running large-scale applications • Applications stop prematurely

How Failures Impact an Application X 0 Processes X 1 … 64K X Time

Ensuring Progress • Applicationscheckpoint, or save their state, for use during recovery • Each process in the application checkpoints • A set of checkpoints, one per process, forms the application checkpoint • Checkpoints are used to resume the application if needed

How Checkpoints Help checkpoint X 0  Processes 1 X  … 64K  Time

Challenge • Application is many processes • Each process saves its own checkpoint • Saved state must be consistent across the application • Must be a state that could have existed

0 1 When is a state not consistent? State could have existed State could not have existed Send not saved Message sent from Process 0 to Process 2 Processes Receive is saved Receive not saved … 64K Send is saved Time

Current Practice • Programmers place checkpoints • Synchronous checkpointing • Processes synchronize and then checkpoint • No in-flight communication • Advantages • Ensures correct state • Leverages programmer knowledge • Two Problems…

Problem One: Programmer Burden • Programmers are trained in other fields • Physics, Chemistry, Geology, … • Should not waste time and talent implementing checkpointing Checkpointing MPI FORTRAN/C/C++ Algorithm

In addition… • Programmers want automatic checkpointing • 65% of Texas Advanced Computing Center (TACC) users responding to a survey want automatic checkpointing [1] • Only 2.5% are currently using automatic checkpointing Goal One: Reduce programmer burden by providing checkpointing support in the compiler [1] TACC User Survey, 2009

Problem Two:Checkpoint Overhead • Checkpoints are written to the global file system • Causes contention • Checkpointing hurts application performance • 30 minutes for tens of thousands of processes to checkpoint on BG/L [Liang06] Goal Two: Reduce file system contention by separating process checkpoints in time while guaranteeing a correct state

Our Solution: Compiler-Assisted Staggered Checkpointing • Places staggered checkpoints in application text • Statically guarantees a correct state • Reduces programmer burden by identifying the locations where each process should checkpoint • Reduces file system contention by separating individual checkpoints in time

0 1 Our Algorithm:Terminology Checkpoint Location State could have existed State could not have existed Send not saved VALID Recovery line Processes Recovery line[Randall 75] Receive is saved Receive not saved … 64K Send is saved Time

Algorithm: Three Steps • Determine communication • Identify inter-process dependences • Generate recovery lines

Algorithmic Assumptions • Communication is explicit at compile-time • Number of processes is known at compile-time

Step One: Determine Communication Find the neighbors for each process: p = sqrt(no_nodes) cell_coord[0][0] = node % p cell_coord[1][0] = node / p; j = cell_coord[0][0] - 1 i = cell_coord[1][0] - 1 from_process = (i – 1 + p) % p + p * j MPI_irecv(x, x, x, from_process, …) (taken from NAS benchmark BT) from_process = (node / (sqrt(no_nodes)) - 1 - 1 + sqrt(no_nodes)) % sqrt(no_nodes) + sqrt(no_nodes) * node % sqrt(no_nodes) - 1

0 1 2 Step One:Determine Communication Processes Time

0 [1,0,0] [2,0,0] [3,2,0] [4,5,2] 1 [2,3,2] [2,4,2] [1,1,0] [1,2,0] [2,5,2] 2 [2,0,1] [2,0,2] [2,4,3] Step Two:Identify Dependences [P0,P1,P2] Track events within a process Vector Clocks: capture inter-process dependences [1,0,0] [1] [2] [3] [4] Processes [3] [4] [1] [2] [5] [1,1,0] [1] [2] [3] [Lamport 78] Time

0 1 2 Step Three: Identify Recovery Lines How do we generate and evaluate valid recovery lines? Processes Time

Recovery Line Algorithm • Finds valid lines first, then evaluates for staggering • What does a recovery line look like? • A set of {process, checkpoint location} pairs, one for each process • Each pair must be valid with every other pair

Naïve Algorithm • Depth-first search with backtracking • Example: four processes, three checkpoint locations This doesn’t work for large numbers of processes! Why? {P0,L0} , {P1,L0} , {P2,L0} , {P2,L1} , {P2,L2} , {P3,L2} , {P3,L1} , {P3,L0} {P0,L0} , {P1,L0} , {P2,L0} , {P3,L1} Valid Recovery Lines {P0,L0} , {P1,L0} , {P2,L0} , {P3,L0}

Search Space is Huge and Sparse • Search space is sized with LP • L is the number of checkpoint locations • P is the number of processes • For the BT benchmark using 16 processes, there are 3816 possible recovery lines • Valid recovery lines are very small amount of the total • BT with 16 processes has 191 valid recovery lines • Makes finding valid recovery lines hard

Generating Recovery Lines: Our Algorithm Only eliminate invalid lines • Reduces the L in LP • Introduces a new basic algorithm • Replaces naïve algorithm • Reduces the P in LP • Introduces a heuristic Eliminate both invalid and valid lines

Reducing L: Phases • Valid recovery lines cannot cross process synchronization points • Barriers • Collective communication • Structure algorithm to search in phases, or the intervals between these points • For the BT benchmark using 16 processes, there are now 1416 possible recovery lines

Reducing L: Merges • Some checkpoint locations are separated by few local operations • Minimizes the benefit of staggered checkpointing • Merge those not separated by a minimum number of operations

But… • The naïve algorithm still doesn’t work • Why? Redundant work! Line: {P0, L0} , {P1,L2} , {P1,L1} , {P2,L2} Whoops! {P2,L2} is incompatible with {P0,L0} Backtracking... We already knew this!

New Basic Algorithm • Constraint-Satisfaction Solution Synthesis [Tsang93] • Eliminates redundant work through upward-propagation of constraints • Creates complete valid recovery lines • Reduces complexity to O(log(P)*LP-2) How?

P3 L0, L2, L3 Basic Algorithm: Example {{P1, L0}, {P2, L0}, {P0, L0}, {P3, L0}, {P5, L0}, {P6, L0}, {P4, L0}, {P7, L0}} {{P1, L1}, {P2, L1}, {P0, L0}, {P3, L2}, {P5, L2}, {P6, L1}, {P4, L0}, {P7, L2}} 1, 2, 0, 3, 5, 6, 4, 7 {{P1, L0}, {P2, L0}, {P0, L0}, {P3, L0}} {{P1, L2}, {P2, L1}, {P0, L0}, {P3, L2}} 1, 2, 0, 3 5, 6, 4, 7 {{P0, L0}, {P3, L0}} {{P0, L0}, {P3, L2}} {{P0, L1}, {P3, L0}} Bad News: Algorithm still scales with the number of processes Good News: No redundant work is performed! 1, 2 0, 3 5, 6 4, 7 1 2 0 3 5 6 4 7 Each block represents a process Partial valid recovery lines are formed from the partial valid recovery lines previously formed by each source partition Processes are merged into partitions, and partial valid recovery lines are formed This last partition forms complete valid recovery lines

Generating Recovery Lines: Our Algorithm • Reduces the L in LP • Introduces a new basic algorithm • Replaces naïve algorithm • Reduces the P in LP • Introduces a heuristic

Reducing P • Observation: # of checkpoint locations « # of processes • So multiple processes must be checkpointing at the same location Let’s be smart about which processes checkpoint together!

Reducing P: Clumps • Clumps are processes that checkpoint together • Formed from processes that communicate between consecutive checkpoint locations • Algorithm treats a clump the same way it treats a process • Processes can belong to more than one clump • Complexity is: O(log(C)*LC-2), where C represents the number of clumps

0 1 2 3 Time Clumps Example 0 1 2 3 For the BT benchmark using 16 processes, there are now 512+21 possible recovery lines Checkpoint Location 0 1 2 3 Processes 1 2 3 1 2 3 Clump Processes Locations a 0,1 0,1 ,2,3 b 2,3 2,3 c 0,1,2,3 1,2

Reducing P: Clump Sets • Clumps are combined into clump sets • Each process is represented exactly once in each clump set • So when all clumps in a set are placed, a complete recovery line is formed • Complexity remains O(log(C)*LC-2), but C now represent the clumps in a clump set For the BT benchmark using 16 processes, there are now 3(54)+21 possible recovery lines Clump Processes a 0,1 b 2,3 c 0,1,2,3

Algorithm Summary For the 1,024 process case, we have reduced the search space by over 1,500 orders of magnitude. However, there are still 69 septillion potential lines. There is more work to be done… Results for the BT benchmark

Aggressively Reducing the Search Space • Prune preliminary results using branch-and-bound • Uses our Wide-and-Flat (WAF) metric • Statically estimates which lines are more staggered • Combines the interval of checkpoint locations (width) and the number of processes that checkpoint at each (flatness) • Works with partial and complete lines • Difficult since there is direct tension between valid and staggered recovery lines

Pruning Policy Results BT

Algorithm Summary • Algorithm scales to large numbers of processes • Scales with clumps not processes • Successfully identifies lines in applications using up to 65,536 processes

Evaluation by Simulation Compiler Translator Simulator FS • Inside Broadway[Guyer00], a source-to-source compiler • Accepts C code, generates C code with checkpoints • Assumptions • Communication neighbors only dependent on: • Node rank • Number of nodes in the system • Other constants • MPI used for communication Event-driven Approximates local, communication, and checkpointing operations using models Averages 83% accuracy Translates application source code to traces Uses static analysis and profiling

Supercomputers • Ranger, a system at TACC • Lustre file system with 40 GB/s throughput • 4 2.3 GHZ processors with 4 cores each per machine • 62,976 cores total • Newer, more modern, currently 11 in Top 500 • Experimental results for 16 processes/machine • Lonestar, also a system at TACC • Lustre file system with 4.6 GB/s throughput • 2 2.66 GHZ processors with 2 cores each per node • 5,840 cores total • Currently 123 in Top 500 • Experimental results for 1 process/node

WAF Metric: Evaluation Results for the BT Benchmark with 4,096 processes

Evaluation of Identified Lines • Lines placed by our algorithm improve checkpointing performance • Average 26% with a three minute interval • Average 44% with a fifteen minute interval • Total execution time is basically unchanged • 7% average improvement is statistically insignificant in our simulator • Penalty for disrupting communication

Conclusions • Staggered checkpointing can improve checkpointing performance • Our algorithm is scalable and works successfully for applications up to 65,536 processes • Lines places by our algorithm improve checkpointing performance by an average of 35% • Larger time intervals in which to stagger lead to larger improvement

Future Work • Extend the WAF metric • Account for checkpoint size • Consider system characteristics • Consider communication disruption • Link with a checkpointing tool • Actually take checkpoints! • Reduce size through compression • Design and implement recovery algorithm

Special Thanks to: Calvin Lin Sung-Eun Choi Texas Advanced Computing Center Thank you!

Thank you!

Backup Slides

Crash Send Omission Receive Omission General Omission Arbitrary failures with message authentication Arbitrary (Byzantine) failures Fault Model

Options forFault-Tolerance • Redundancy in space • Each participating process has a backup process • Expensive! • Redundancy in time • Processes save state and then rollback for recovery • Lighter-weight fault tolerance

Towards Scalable Checkpointing for Supercomputing Applications

Towards Scalable Checkpointing for Supercomputing Applications

Presentation Transcript

Building Scalable Cloud Applications

Rebound: Scalable Checkpointing for Coherent Shared Memory

Checkpointing-Recovery

Scalable Performance Optimizations for Dynamic Applications

Scalable Solvers and Software for PDE Applications

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

Peter Cao, National Center for Supercomputing Applications

Towards Environment for Multiscale Applications

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

Towards programming environment for personal supercomputing

Towards Ubiquitous Supercomputing

Building Scalable .NET Applications

Portable Checkpointing for BSP Applications on Grid Environments

Designing scalable applications for cloud

Towards a Scalable Database Service

Supercomputing for Nanoscience

Towards Scalable Critical Alert Mining

Scalable Solvers and Software for PDE Applications

Towards Scalable Pub/Sub Systems

Scalable Solvers and Software for PDE Applications

Scalable Memory Management for Multithreaded Applications

A Scalable Simulator for TinyOS Applications