280 likes | 392 Views
Building and using an FT MPI implementation. Graham Fagg HLRS / UTK fagg@hlrs.de. What is FT-MPI. FT-MPI is a fault tolerant MPI system developed under the DOE HARNESS project What does Fault Tolerant mean?
E N D
Building and using an FT MPI implementation Graham Fagg HLRS / UTK fagg@hlrs.de
What is FT-MPI • FT-MPI is a fault tolerant MPI system developed under the DOE HARNESS project • What does Fault Tolerant mean? • Failures do not cause instant application termination if there is either an application failure or system failure. • Application gets to decide at the MPI API.
What is FT-MPI • Why Fault Tolerant MPI? • MTBFnode < JobRun • OK for small jobs and small number of nodes… • (MTBFnode * nodes) < JobRunMuchBiggerJob • Or you have a distributed application • Or a very distributed application • Or a very large very distributed application -> GRID…
Communicator Normal MPI semantics Logical Layer Host% Abort: error code XX Host% Rank 0 Rank 1 Rank 2 Rank 2 Physical layer
FT-MPI handling of errors • Under FT-MPI when a member of a communicator dies: • The communicator state changes to indicate a problem • Messages transfers can continue if safe or be stopped (ignored) • To continue: • The users application can ‘fix’ the communicators or abort.
Fixing communicators • Fixing a communicator really means deciding when it is safe to continue. • The application must decide: • Which processes are still members • Which messages need to be sent • The fix happens when a collective communicator creation occurs • MPI_Comm_crearte /MPI_Comm_dup etc • Special shortcut, dup on MPI_COMM_WORLD
5 ways to patch it up • There are 5 modes of recovery, they effect the size (extent) and ordering of the communicators • ABORT: just do as other implementations • BLANK: leave holes • But make sure collectives do the right thing afterwards • SHRINK: re-order processes to make a contiguous communicator • Some ranks change • REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD • REBUILD_ALL: same as REBUILD except rebuilds all communicators, groups and resets all key values etc.
Uses of different modes • BLANK • Good for parameter sweeps / Monte Carlo simulations where process loss only means resending of data. • SHRINK • Same as BLANK accept where users need the communicators size to match its extent • I.e. when using home grown collectives
Uses of different modes • REBUILD • Applications that need a constant number of processes • Fixed grids / most solvers • REBUILD_ALL • Same as REBUILD except does a lot more work behind the scenes. • Useful for applications where there is multiple communicators (for each dimension) and SOME of key values etc. • Slower and has slightly higher overhead due to extra state it has to distribute
Using FT-MPI /* make sure it is sent example */ Do { Rc = MPI_Send (…. com ); If (rc==MPI_ERR_OTHER) { MPI_Comm_dup (com, newcom ); MPI_Comm_free (com); com = newcom; } } while (rc!=MPI_SUCCESS);
Using FT-MPI • Checking every call is not always necessary • A master-slave code may only need a few of the operations in the master code checked.
Using FT-MPI • Using FT-MPI.. On something more complex. • Made worse by structured programming that hides MPI calls below many layers. • The layers are usually in different libraries.
Using FT-MPI Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) …..
Using FT-MPI Build an unstructured grid Distribute some work Solve my part we detect a failure via an MPI call here... Do I=0, XXX …. MPI_Sendrecv ( ) …..
Using FT-MPI Build an unstructured grid We need To fix it up here Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) ….. You are here
Using FT-MPI • Using MPI Error handlers makes for neater code. • All application recovery operations can occur in the users handler so every MPI call does not need to be checked. /* install my recovery handler just once */ MPI_Errhandler_create (my_recover_function, &errh); MPI_Errhandler_get (MCW, &orghandler); MPI_Errhandler_free (&errh_orghandler); MPI_Errhandler_set (MPI_COMM_WORLD, errh); /* all communicators created from now on get this handler */ /* line by line checking */ /* automatic checking */ if (MPI_Send…) { MPI_Send (…) call recovery MPI_Recv (…) … } MPI_Scatter (…) if (MPI_Recv..) { call recovery … } If (MPI_Scatter..) { call recovery .. }
rc=MPI_Init (…) If normal startup Install Error Handler & Set LongJMP Call Solver (…) MPI_Finalize(…)
rc=MPI_Init (…) Set LongJMP ErrorHandler Do recover ( ) Do JMP Call Solver (…) On error (automatic via the mpi library) MPI_Finalize(…)
rc=MPI_Init (…) If rc==MPI_Restarted ErrorHandler Set LongJMP I am New Do recover ( ) Call Solver (…) MPI_Finalize(…)
Implementation details Built in multiple layers Has tuned collectives and user derived data Type handling. Users need to re-compile to libftmpi and start application with ftmpirun command Can be run both with and without a HARNESS core: with core uses FT-MPI plug-ins standalone uses extra daemon on each host to facilitate startup and (failure) monitoring
Implementation details • Distributed recovery • Uses a single dynamically created ‘master’ list of ‘living’ nodes • List is compiled by a ‘leader’ • Leader picked by using an atomic swap on a record in ‘some naming service’ • List is distributed by an atomic broadcast • Can survive multiple nested failures…
Name Service MPI application MPI application Ftmpi_notifier libftmpi libftmpi Startup_d Startup_d Implementation details
Status and future • Beta version • Limited number of MPI functions supported • Currently working on getting PETSc (The Portable, Extensible Toolkit for Scientific Computation from ANL) working in a FT mode • Target of 100+ functions by SC2002. • WHY so many? Every real world library uses more than the ‘6’ MPI required functions.. If it is in the standard then it will be used. • Covers all major classes of functions in MPI. • Future work • Templates for different classes of MPI applications so users can build on our work • Some MPI-2 support (PIO?) Dynamic tasks is easy for us!
Conclusion • Not condor for MPI • Can do more than a reload-restart • Application must do some work • But they decide what • Middleware for building FT applications with • I.e. do we know how to do this kind of recovery? Yet?? • Not a slower alternative • Cost at recover time (mostly) • Standard gets in the way
Links and further information • HARNESS and FT-MPI at UTK/ICL http://icl.cs.utk.edu/harness/ • HARNESS at Emory University http://www.mathcs.emory.edu/harness/ • HARNESS at ORNL http://www.epm.ornl.gov/harness/