1 / 28

Building and using an FT MPI implementation

Building and using an FT MPI implementation. Graham Fagg HLRS / UTK fagg@hlrs.de. What is FT-MPI. FT-MPI is a fault tolerant MPI system developed under the DOE HARNESS project What does Fault Tolerant mean?

oren
Download Presentation

Building and using an FT MPI implementation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building and using an FT MPI implementation Graham Fagg HLRS / UTK fagg@hlrs.de

  2. What is FT-MPI • FT-MPI is a fault tolerant MPI system developed under the DOE HARNESS project • What does Fault Tolerant mean? • Failures do not cause instant application termination if there is either an application failure or system failure. • Application gets to decide at the MPI API.

  3. What is FT-MPI • Why Fault Tolerant MPI? • MTBFnode < JobRun • OK for small jobs and small number of nodes… • (MTBFnode * nodes) < JobRunMuchBiggerJob • Or you have a distributed application • Or a very distributed application • Or a very large very distributed application -> GRID…

  4. Communicator Normal MPI semantics Logical Layer Host% Abort: error code XX Host% Rank 0 Rank 1 Rank 2 Rank 2 Physical layer

  5. FT-MPI handling of errors • Under FT-MPI when a member of a communicator dies: • The communicator state changes to indicate a problem • Messages transfers can continue if safe or be stopped (ignored) • To continue: • The users application can ‘fix’ the communicators or abort.

  6. Fixing communicators • Fixing a communicator really means deciding when it is safe to continue. • The application must decide: • Which processes are still members • Which messages need to be sent • The fix happens when a collective communicator creation occurs • MPI_Comm_crearte /MPI_Comm_dup etc • Special shortcut, dup on MPI_COMM_WORLD

  7. 5 ways to patch it up • There are 5 modes of recovery, they effect the size (extent) and ordering of the communicators • ABORT: just do as other implementations • BLANK: leave holes • But make sure collectives do the right thing afterwards • SHRINK: re-order processes to make a contiguous communicator • Some ranks change • REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD • REBUILD_ALL: same as REBUILD except rebuilds all communicators, groups and resets all key values etc.

  8. Shrink example

  9. Uses of different modes • BLANK • Good for parameter sweeps / Monte Carlo simulations where process loss only means resending of data. • SHRINK • Same as BLANK accept where users need the communicators size to match its extent • I.e. when using home grown collectives

  10. Uses of different modes • REBUILD • Applications that need a constant number of processes • Fixed grids / most solvers • REBUILD_ALL • Same as REBUILD except does a lot more work behind the scenes. • Useful for applications where there is multiple communicators (for each dimension) and SOME of key values etc. • Slower and has slightly higher overhead due to extra state it has to distribute

  11. Using FT-MPI /* make sure it is sent example */ Do { Rc = MPI_Send (…. com ); If (rc==MPI_ERR_OTHER) { MPI_Comm_dup (com, newcom ); MPI_Comm_free (com); com = newcom; } } while (rc!=MPI_SUCCESS);

  12. Using FT-MPI • Checking every call is not always necessary • A master-slave code may only need a few of the operations in the master code checked.

  13. Using FT-MPI • Using FT-MPI.. On something more complex. • Made worse by structured programming that hides MPI calls below many layers. • The layers are usually in different libraries.

  14. Using FT-MPI Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) …..

  15. Using FT-MPI Build an unstructured grid Distribute some work Solve my part we detect a failure via an MPI call here... Do I=0, XXX …. MPI_Sendrecv ( ) …..

  16. Using FT-MPI Build an unstructured grid We need To fix it up here Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) ….. You are here

  17. Using FT-MPI • Using MPI Error handlers makes for neater code. • All application recovery operations can occur in the users handler so every MPI call does not need to be checked. /* install my recovery handler just once */ MPI_Errhandler_create (my_recover_function, &errh); MPI_Errhandler_get (MCW, &orghandler); MPI_Errhandler_free (&errh_orghandler); MPI_Errhandler_set (MPI_COMM_WORLD, errh); /* all communicators created from now on get this handler */ /* line by line checking */ /* automatic checking */ if (MPI_Send…) { MPI_Send (…) call recovery MPI_Recv (…) … } MPI_Scatter (…) if (MPI_Recv..) { call recovery … } If (MPI_Scatter..) { call recovery .. }

  18. rc=MPI_Init (…) If normal startup Install Error Handler & Set LongJMP Call Solver (…) MPI_Finalize(…)

  19. rc=MPI_Init (…) Set LongJMP ErrorHandler Do recover ( ) Do JMP Call Solver (…) On error (automatic via the mpi library) MPI_Finalize(…)

  20. rc=MPI_Init (…) If rc==MPI_Restarted ErrorHandler Set LongJMP I am New Do recover ( ) Call Solver (…) MPI_Finalize(…)

  21. Implementation details Built in multiple layers Has tuned collectives and user derived data Type handling. Users need to re-compile to libftmpi and start application with ftmpirun command Can be run both with and without a HARNESS core: with core uses FT-MPI plug-ins standalone uses extra daemon on each host to facilitate startup and (failure) monitoring

  22. Implementation details • Distributed recovery • Uses a single dynamically created ‘master’ list of ‘living’ nodes • List is compiled by a ‘leader’ • Leader picked by using an atomic swap on a record in ‘some naming service’ • List is distributed by an atomic broadcast • Can survive multiple nested failures…

  23. Name Service MPI application MPI application Ftmpi_notifier libftmpi libftmpi Startup_d Startup_d Implementation details

  24. Performance

  25. Performance

  26. Status and future • Beta version • Limited number of MPI functions supported • Currently working on getting PETSc (The Portable, Extensible Toolkit for Scientific Computation from ANL) working in a FT mode • Target of 100+ functions by SC2002. • WHY so many? Every real world library uses more than the ‘6’ MPI required functions.. If it is in the standard then it will be used. • Covers all major classes of functions in MPI. • Future work • Templates for different classes of MPI applications so users can build on our work • Some MPI-2 support (PIO?) Dynamic tasks is easy for us!

  27. Conclusion • Not condor for MPI • Can do more than a reload-restart • Application must do some work • But they decide what • Middleware for building FT applications with • I.e. do we know how to do this kind of recovery? Yet?? • Not a slower alternative • Cost at recover time (mostly) • Standard gets in the way

  28. Links and further information • HARNESS and FT-MPI at UTK/ICL http://icl.cs.utk.edu/harness/ • HARNESS at Emory University http://www.mathcs.emory.edu/harness/ • HARNESS at ORNL http://www.epm.ornl.gov/harness/

More Related