120 likes | 218 Views
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing. Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, Jack Dongarra Presented by Todd Gamblin. Background: Failure.
E N D
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, Jack Dongarra Presented by Todd Gamblin
Background: Failure • MTTF of high-performance computers is getting to be shorter than execution times of HPC applications • Even 10,000 processors can imply failures every hour • BlueGene/L already has 65,000, and will have 131,000 • Why so bad? • Commodity parts are cheap to use, more and more computers being built with them • Commodity parts => Commodity-targeted MTTF • Great for your desktop, not so great for 100,000 of them
General solution: Checkpointing • Save application state at synch points so that we can recover from a fault • Save state to disk or other stable storage • High overhead for copying data to disk • Persistent, can use for 2-level fault tolerance, can survive failure of all processors • Can keep redundant copies of state in memory of other CPUs • Called Diskless Checkpointing • Faster, can’t survive total failure
What policy? • Runtime system does checkpointing • Completely general, no programmer effort • Must save everything • Binary memory dumps rule out recovery on heterogeneous systems • Binary dumps don’t work in some cases (round-off error in reversed FP computations during rollback causes failure) • Application does checkpointing (Authors like this one) • Requires programmer effort • Can streamline amt. of state that needs to be saved • Can get consistency for free by placing checkpoints at application’s synch points • Can store machine-independent data, recover on diverse systems
FT-MPI : Application level checkpointing • What happens to communicators after failure? • Abort everything (default in all MPI) • Failed processes just die, others keep running, MPI_COMM_WORLD has holes • Failed processes die, but MPI_COMM_WORLD shrinks and ranks can change • Failed processes are respawned, ranks are same, MPI_COMM_WORLD same size • What happens to messages on failure? • All ops that would have returned MPI_SUCCESS finish properly, even if a process died • All operations in a collective communication fail if a process fails • That’s it! Everything else is up to the application
Diskless Checkpointing • So… we should probably do something about those faults, since FT-MPI doesn’t. • The paper tells us how to restore the state for floating point data • 2 Schemes • Mirrored - store copies of data on neighbors • Checksum - Store checksum of FP values on neighbors
Neighbor-based Checksums • Up to n failures, so long as checkpoint and compute processor don’t fail • Redundant processors • Survives up to floor(n/2) failures, again depending on distribution • 2 neighbors can’t fail • No redundant procs. • Best fault tolerance of these • Still can’t have neighbors fail
Basic Checksums • Can’t withstand more than one failure • Straight sum of FP numbers • On failure, 1 eqn, 1 unknown =>recalculate the unknown • Likelihood of failure depends on distribution of failures in groups • Checkpoint encodings can be done in parallel • Probability of failure is
Weighted Checksums • Can survive as long as there are more live checksum processors than dead nodes • Each checksum processor is the solution to an equation, which we’ll need to solve to regenerate data at each Pi: • Multiple groups of the setup on the left. • Can adapt weightings, number of checkpoint nodes to reliability of particular subgroups
Need to avoid numerical error • Recomputing checkpoints involves solving a system of equations • Need a well-conditioned weighting matrix to do this • Also need any submatrix to be well-conditioned • Solution: Use a Gaussian random matrix • Gaussian random matrices are well-conditioned (with high probability) • Nice property: Submatrix of a matrix with Gaussian random values is Gaussian • Average loss of 1 digit of precision on reconstruction • Probability of the loss of 2 digits is 3.1e-11 • See paper (actually another referenced paper) for details on proof of this.
Results • Tested Checkpointing & FT-MPI with Conjugate Gradient Solver • Only checkpointing 3 vectors, 2 scalars (light load) • More performance overhead than mirrored approach, but 1/2 the storage overhead • Performance of FT-MPI • Comparable to MPICH2 (slightly faster), 2x speed of MPICH 1 • Overhead of weighted checkpointing • About 2% for 5 checkpoint nodes, 64 compute nodes • Overhead of recovery • About 1% for 5 CP nodes, 64 compute nodes • Numerical Error in residuals in solver < 5.0e-6
Questions • How easy would it be to automate FP checkpointing like this? It seems like a pain to add to everything. • Authors suggsest adding to numerical packages • Could we make a tool? CpPablo? • Can we make weights/groups of checkpointed processors adaptive? • e.g. might want to assign groups based on hot/cold areas in machine room • What other ways are there around problems of binary checkpointing in heterogeneous environments?