Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, Jack Dongarra Presented by Todd Gamblin

Background: Failure • MTTF of high-performance computers is getting to be shorter than execution times of HPC applications • Even 10,000 processors can imply failures every hour • BlueGene/L already has 65,000, and will have 131,000 • Why so bad? • Commodity parts are cheap to use, more and more computers being built with them • Commodity parts => Commodity-targeted MTTF • Great for your desktop, not so great for 100,000 of them

General solution: Checkpointing • Save application state at synch points so that we can recover from a fault • Save state to disk or other stable storage • High overhead for copying data to disk • Persistent, can use for 2-level fault tolerance, can survive failure of all processors • Can keep redundant copies of state in memory of other CPUs • Called Diskless Checkpointing • Faster, can’t survive total failure

What policy? • Runtime system does checkpointing • Completely general, no programmer effort • Must save everything • Binary memory dumps rule out recovery on heterogeneous systems • Binary dumps don’t work in some cases (round-off error in reversed FP computations during rollback causes failure) • Application does checkpointing (Authors like this one) • Requires programmer effort • Can streamline amt. of state that needs to be saved • Can get consistency for free by placing checkpoints at application’s synch points • Can store machine-independent data, recover on diverse systems

FT-MPI : Application level checkpointing • What happens to communicators after failure? • Abort everything (default in all MPI) • Failed processes just die, others keep running, MPI_COMM_WORLD has holes • Failed processes die, but MPI_COMM_WORLD shrinks and ranks can change • Failed processes are respawned, ranks are same, MPI_COMM_WORLD same size • What happens to messages on failure? • All ops that would have returned MPI_SUCCESS finish properly, even if a process died • All operations in a collective communication fail if a process fails • That’s it! Everything else is up to the application

Diskless Checkpointing • So… we should probably do something about those faults, since FT-MPI doesn’t. • The paper tells us how to restore the state for floating point data • 2 Schemes • Mirrored - store copies of data on neighbors • Checksum - Store checksum of FP values on neighbors

Neighbor-based Checksums • Up to n failures, so long as checkpoint and compute processor don’t fail • Redundant processors • Survives up to floor(n/2) failures, again depending on distribution • 2 neighbors can’t fail • No redundant procs. • Best fault tolerance of these • Still can’t have neighbors fail

Basic Checksums • Can’t withstand more than one failure • Straight sum of FP numbers • On failure, 1 eqn, 1 unknown =>recalculate the unknown • Likelihood of failure depends on distribution of failures in groups • Checkpoint encodings can be done in parallel • Probability of failure is

Weighted Checksums • Can survive as long as there are more live checksum processors than dead nodes • Each checksum processor is the solution to an equation, which we’ll need to solve to regenerate data at each Pi: • Multiple groups of the setup on the left. • Can adapt weightings, number of checkpoint nodes to reliability of particular subgroups

Need to avoid numerical error • Recomputing checkpoints involves solving a system of equations • Need a well-conditioned weighting matrix to do this • Also need any submatrix to be well-conditioned • Solution: Use a Gaussian random matrix • Gaussian random matrices are well-conditioned (with high probability) • Nice property: Submatrix of a matrix with Gaussian random values is Gaussian • Average loss of 1 digit of precision on reconstruction • Probability of the loss of 2 digits is 3.1e-11 • See paper (actually another referenced paper) for details on proof of this.

Results • Tested Checkpointing & FT-MPI with Conjugate Gradient Solver • Only checkpointing 3 vectors, 2 scalars (light load) • More performance overhead than mirrored approach, but 1/2 the storage overhead • Performance of FT-MPI • Comparable to MPICH2 (slightly faster), 2x speed of MPICH 1 • Overhead of weighted checkpointing • About 2% for 5 checkpoint nodes, 64 compute nodes • Overhead of recovery • About 1% for 5 CP nodes, 64 compute nodes • Numerical Error in residuals in solver < 5.0e-6

Questions • How easy would it be to automate FP checkpointing like this? It seems like a pain to add to everything. • Authors suggsest adding to numerical packages • Could we make a tool? CpPablo? • Can we make weights/groups of checkpointed processors adaptive? • e.g. might want to assign groups based on hot/cold areas in machine room • What other ways are there around problems of binary checkpointing in heterogeneous environments?

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing

Presentation Transcript

FT-MPICH : Providing fault tolerance for MPI parallel applications

MPI

FT-MPI

Open MPI - A High Performance Fault Tolerant MPI Library

FT-MPI

Two Example Parallel Programs using MPI

MPI

Building Algorithmically Nonstop Fault Tolerant MPI Programs

MPI

Timing MPI Programs

MPI

Fault Tolerant MPI

Building and using an FT MPI implementation

MPI

MPI

Diskless Checkpointing

HARNESS and Fault Tolerant MPI

Two Example Parallel Programs using MPI

MPI

MPI

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing

Timing MPI Programs