1 / 12

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing. Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, Jack Dongarra Presented by Todd Gamblin. Background: Failure.

junius
Download Presentation

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, Jack Dongarra Presented by Todd Gamblin

  2. Background: Failure • MTTF of high-performance computers is getting to be shorter than execution times of HPC applications • Even 10,000 processors can imply failures every hour • BlueGene/L already has 65,000, and will have 131,000 • Why so bad? • Commodity parts are cheap to use, more and more computers being built with them • Commodity parts => Commodity-targeted MTTF • Great for your desktop, not so great for 100,000 of them

  3. General solution: Checkpointing • Save application state at synch points so that we can recover from a fault • Save state to disk or other stable storage • High overhead for copying data to disk • Persistent, can use for 2-level fault tolerance, can survive failure of all processors • Can keep redundant copies of state in memory of other CPUs • Called Diskless Checkpointing • Faster, can’t survive total failure

  4. What policy? • Runtime system does checkpointing • Completely general, no programmer effort • Must save everything • Binary memory dumps rule out recovery on heterogeneous systems • Binary dumps don’t work in some cases (round-off error in reversed FP computations during rollback causes failure) • Application does checkpointing (Authors like this one) • Requires programmer effort • Can streamline amt. of state that needs to be saved • Can get consistency for free by placing checkpoints at application’s synch points • Can store machine-independent data, recover on diverse systems

  5. FT-MPI : Application level checkpointing • What happens to communicators after failure? • Abort everything (default in all MPI) • Failed processes just die, others keep running, MPI_COMM_WORLD has holes • Failed processes die, but MPI_COMM_WORLD shrinks and ranks can change • Failed processes are respawned, ranks are same, MPI_COMM_WORLD same size • What happens to messages on failure? • All ops that would have returned MPI_SUCCESS finish properly, even if a process died • All operations in a collective communication fail if a process fails • That’s it! Everything else is up to the application

  6. Diskless Checkpointing • So… we should probably do something about those faults, since FT-MPI doesn’t. • The paper tells us how to restore the state for floating point data • 2 Schemes • Mirrored - store copies of data on neighbors • Checksum - Store checksum of FP values on neighbors

  7. Neighbor-based Checksums • Up to n failures, so long as checkpoint and compute processor don’t fail • Redundant processors • Survives up to floor(n/2) failures, again depending on distribution • 2 neighbors can’t fail • No redundant procs. • Best fault tolerance of these • Still can’t have neighbors fail

  8. Basic Checksums • Can’t withstand more than one failure • Straight sum of FP numbers • On failure, 1 eqn, 1 unknown =>recalculate the unknown • Likelihood of failure depends on distribution of failures in groups • Checkpoint encodings can be done in parallel • Probability of failure is

  9. Weighted Checksums • Can survive as long as there are more live checksum processors than dead nodes • Each checksum processor is the solution to an equation, which we’ll need to solve to regenerate data at each Pi: • Multiple groups of the setup on the left. • Can adapt weightings, number of checkpoint nodes to reliability of particular subgroups

  10. Need to avoid numerical error • Recomputing checkpoints involves solving a system of equations • Need a well-conditioned weighting matrix to do this • Also need any submatrix to be well-conditioned • Solution: Use a Gaussian random matrix • Gaussian random matrices are well-conditioned (with high probability) • Nice property: Submatrix of a matrix with Gaussian random values is Gaussian • Average loss of 1 digit of precision on reconstruction • Probability of the loss of 2 digits is 3.1e-11 • See paper (actually another referenced paper) for details on proof of this.

  11. Results • Tested Checkpointing & FT-MPI with Conjugate Gradient Solver • Only checkpointing 3 vectors, 2 scalars (light load) • More performance overhead than mirrored approach, but 1/2 the storage overhead • Performance of FT-MPI • Comparable to MPICH2 (slightly faster), 2x speed of MPICH 1 • Overhead of weighted checkpointing • About 2% for 5 checkpoint nodes, 64 compute nodes • Overhead of recovery • About 1% for 5 CP nodes, 64 compute nodes • Numerical Error in residuals in solver < 5.0e-6

  12. Questions • How easy would it be to automate FP checkpointing like this? It seems like a pain to add to everything. • Authors suggsest adding to numerical packages • Could we make a tool? CpPablo? • Can we make weights/groups of checkpointed processors adaptive? • e.g. might want to assign groups based on hot/cold areas in machine room • What other ways are there around problems of binary checkpointing in heterogeneous environments?

More Related