1 / 63

FT-MPI

FT-MPI. Graham E Fagg Making of the holy grail or a YAMI that is FT. FT-MPI. What is FT-MPI (its no YAMI) Building an MPI for Harness First sketch of FT-MPI Simple FT enabled Example A bigger meaner example ( PSTSWM ) Second view of FT-MPI Future directions. FT-MPI is not just a YAMI.

sunila
Download Presentation

FT-MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FT-MPI Graham E Fagg Making of the holy grail or a YAMI that is FT

  2. FT-MPI • What is FT-MPI (its no YAMI) • Building an MPI for Harness • First sketch of FT-MPI • Simple FT enabled Example • A bigger meaner example (PSTSWM) • Second view of FT-MPI • Future directions

  3. FT-MPI is not just a YAMI • FT-MPI as in Fault Tolerant MPI • Why make a FT version? • Harness is going to be very robust compared to previous systems. • No single point of failure unlike PVM • Allow MPI users to take advantage of this high level of robustness, rather than just provide Yet Another MPI Implementation (YAMI)

  4. Why FT-MPI • Current MPI applications live under the MPI fault tolerant model of no faults allowed. • This is great on an MPP as if you lose a node you generally lose a partion anyway. • Makes reasoning about results easy. If there was a fault you might have received incomplete/incorrect values and hense have the wrong result anyway.

  5. Why FT-MPI • No-matter how we implement FT-MPI, it must follow current MPI-1.2 (or 2) practices. I.e. we can’t really change too much about how it works (semmantics) or how it looks (syntax). • Makes coding for FT a little interesting and very dependent on the target application classes. As will be shown.

  6. So first what does MPI do? • All communication is via a communicator • Communicators form an envelope in which communication can occur, and contains information such as process groups, topology information and attributes (key values)

  7. What does MPI do? • When an application starts up, it has a single communicator that contains all members known an MPI_COMM_WORLD • Other communicators containing sub-section of the original communictor can be created from this communicator using collective (meaning blocking, group operations).

  8. What does MPI do? • Until MPI-2 and the advent of MPI_Spawn (which isnot really supported by any implementations except LAM) it was not possible to add new members to the range of addressable members in an MPI application. • If you can’t address (name) them, you can’t communicate directly with them.

  9. What does MPI do? • If a member of a communicator failed for some reason, the specification mandated that rather than continuing which would lead to unknown results in a doomed application, the communicator is invalidated and the application halted in a clean manner. • In simple if something fails, everything does.

  10. What we would like? • Many applications are capable or can be made capable of surviving such a random failure. • Initial Goal: • Provide a version of MPI that allows a range of alternatives to an application when a sub-part of the application has failed. • Range of alternatives depends on how the applications themselves will handle the failure.

  11. Building an MPI for Harness • Before we get into the gritty of what we do when we get an error, how are we going to build something in the first place? • Two methods: • Take an existing implementation (ala MPICH) and re-engineer it for our own uses (the most popular method currently) • Build an implementation from the ground up.

  12. Building a YAMI • Taking MPICH and building an FT version should be simple…? • It has a layering system, the MPI API sits on top of the data-structures that sit ontop of a collective communication model, which calls an ADI that provides p2p communications.

  13. Building a YAMI • MSS tried this with their version of MPI for the Cray T3E • Found that the layering was not very clean, lots of short cuts and data passed between the layers without going through the expected APIs. • Esp true of routines that handle startup (I.e. process management)

  14. Building a YAMI • Building a YAMI from scratch • Not impossible but time consuming • Too many function calls to support (200+) • Can implement a subset (just like compiler writers did for HPF with subset HPF) • If we later want a *full* implementation then we need a much larger team that we current have. (Look at how long it has taken ANL to keep up to date, and look at their currently outstanding bug list).

  15. Building a YAMI • Subset of operations best way to go • Allows us to test a few key applications and find out just how useful and applicable a FT-MPI would be.

  16. Building an MPI for Harness • What does Harness give us, and what do we have to build ourselves? • Harness will give us basic functionality of starting tasks, some basic comms between them, some attribute storage (mboxes) and some indication of errors and failures. • I.e. mostly what PVM gives us at the moment. • As well as the ability to plug extra bits in...

  17. Harness Basic Structure Application Application TCP/IP basic link Harness run-time Pipes / sockets TCP/IP HARNESS Daemon

  18. Harness Basic Structure Repository Application Application Pipes / sockets TCP/IP Internal Harness meta-data storage HARNESS Daemon

  19. Harness Basic Structure Repository Application Application Pipes / sockets TCP/IP Internal Harness meta-data storage HARNESS Daemon

  20. Harness Basic Structure Application Application Pipes / sockets TCP/IP Internal Harness meta-data storage HARNESS Daemon

  21. Harness Basic Structure Application Application Pipes / sockets TCP/IP Internal Harness meta-data storage HARNESS Daemon

  22. Harness Basic Structure Application Application Harness run-time FM-Comms-Plugin Pipes / sockets TCP/IP Internal Harness meta-data storage HARNESS Daemon

  23. So what do we need to build for FT-MPI? • Build the run-time components that provide the user application with an MPI API • Build an interface in this run-time component that allows for fast communications so that we at least provide something that doesn’t run like a 3 legged dog.

  24. Building the run-time system • The system can be built as several layers. • The top layer is the MPI API • The next layer handles the internal MPI data structures and some of the data buffering. • The next layer handles the collective communications. • Breaks them down to p2p, but in a modular way so that different collective operations can be optimised differently depending on the target architecture. • The lowest layer handles p2p communications.

  25. Building the run-time system • Do we have any of this already? • Yes… the MPI API layer is currently in a file called MPI_Connect/src/com_layer.c • Most of the data structures are in com_list, msg_list.c, lists.c and hash.c • Hint, try compiling the library with the flag -DNOMPI • Means we know what we are up against.

  26. Building the run-time system • Most complex part if handling the collective operations and all the variants of vector operations. • PACX and MetaMPI do not support them all, but MagPie is getting closer.

  27. What is MagPie ? • A Black and White bird that collects shinny objects. • A software system by Thilo Kielmann of Vrije Universiteit, Amsterdam, NL. • ‘Collects’ is the important word here as its is a package that supports efficient collective operations across multiple clusters. • Most collective operation in most MPI implementation break down into a series of broadcasts which scale well across switches as long as the switches are homogeneous, which is not the case for cluster of clusters. • I.e. can use MagPie to provide the collective substrate.

  28. Building the run-time system • Just leaves the p2p system, and the interface to the Harness daemons themselves. • The p2p system can be build on Martins fast message layer. • The Harness interface can be implemented on top of PVM 3.4 for now, until Harness itself becomes available.

  29. Building the run-time system • Last details to worry about is how we are going to change the MPI semantics to report errors and how we continue after them. • Taking note of how we know there is a failure in the first place.

  30. First sketch of FT-MPI • First view of FT-MPI is where the users application is able to handle errors and all we have to provide is: • A simple method for indicating errors/failures • A simple method for recovering from errors

  31. First sketch of FT-MPI • 3 initial models of failure (another later on) • (1) There is a failure and the application is shut down (MPI default; gains us little other than meeting the standard). • (2) Failure only effects members of a communicator which communicate with the failed party. I.e. p2p coms still work within the communicator. • (3) That communicator is invalidated completely.

  32. First sketch of FT-MPI • How do we detect failure? • 4 ways… (1) We are told its going to happen by a member of a particular application. (ie I have NaNs everywhere.. Panic) (2) A point-2-point communication fails (3) The p2p system tells use that some-one failed (error propergation within a communicator at the run-time system layer) (much like (1)) (4) Harness tells us via a message from the daemon.

  33. First sketch of FT-MPI • How do we tell the user application? • Return it an MPI_ERR_OTHER • Force it to check an additional MPI error call to find where the failure occurred. • Via the cached attribute key values • FT_MPI_PROC_FAILED which is a vector of length MPI_COMM_SIZE of the original communicator. • How do we recover if we have just invalidated the communicator the application will use to recover on?

  34. First sketch of FT-MPI • Some functions are allowed to be used in a partial form to facilitate recovery. • I.e. MPI_Comm_barrier ( ) can still be used to sync processes, but will only wait for the surviving processes… • The formation of a new communicator will also be allowed to work with a broken communicator. • MPI_Finalize does not need a communicator specified.

  35. First sketch of FT-MPI • Forming a new communicator that the application can use to continue is the important part. • Two functions can modified to be used: • MPI_COMM_CREATE (comm, group, newcomm ) • MPI_COMM_SPLIT (comm, colour, key, newcomm )

  36. First sketch of FT-MPI • MPI_COMM_CREATE ( ) • Called with the group set to a new constant • FT_MPI_LIVING (!) • Creates a new communicator that contains all the processes that continue to survive. • Special case could be to allow MPI_COMM_WORLD to be specified as both input and output communicator.

  37. First sketch of FT-MPI • MPI_COMM_SPLIT ( ) • Called with the colour set to a new constant • FT_MPI_NOT_DEAD_YET (!) • key can be used to control the new rank of processes within the new communicator. • Again creates a new communicator that contains all the processes that continue to survive.

  38. Simple FT enabled Example • Simple application at first • Bag of tasks, where the tasks know how to handle a failure. • Server just divides up the next set of data to be calculated between the survivors. • Clients nominate a new server if they have enough state. • (Can get the state by using ALL2ALL communications for results).

  39. A bigger meaner example (PSTSWM) • Parallel Spectral Transform Shallow Water Model • 2D grid calculation • 3D in actual computation, with 1 axis performing FFTs, the second global reductions and the third layering sequentially upon each logical processor. • Calculation cannot support reduced grids like those supported by the Parallel Community Climate Model (PCMM), a future target application for FT-MPI. • I.e. if we lose a logical grid point (node) we must replace it!

  40. A bigger meaner example (PSTSWM) • First Sketch ideas for FT-MPI are fine for applications that can handle a failure and have functional calling sequences that are not too deep… • I.e. MPI API calls can be buried deep within routines and any errors may take quite a while to bubble to the surface where the application can take effective action to handle them and recover.

  41. A bigger meaner example (PSTSWM) • This application proceeds in a number of well defined stages and can only handle failure by restarting from a known set of data. • I.e. user checkpoints have to be taken, and must still be reachable. • User requirement is for the application to be started and run to completion with the system automatically handling errors without manual intervention.

  42. A bigger meaner example (PSTSWM) • Invalidating the failed communicators only as in the first sketch are not enough for this application. • PSTSWM creates communicators for each row and column of the 2-D grid.

  43. A bigger meaner example (PSTSWM)

  44. A bigger meaner example (PSTSWM) Failed Node

  45. A bigger meaner example (PSTSWM) Failed Node Failed Communicator Failed Communicator

  46. A bigger meaner example (PSTSWM) This is unknown (butterfly p2p) Failed Node This communication works

  47. A bigger meaner example (PSTSWM) Failed Node This is unknown as the pervious failure on the axis might not have been detected...

  48. A bigger meaner example (PSTSWM) • What is really wanted is for four things to happen…. • Firstly, ALL communicators are marked as broken… even if some are recoverable. • The underlying system propagates errors message to all communicators, not just the ones directly effected by the failure. • Secondly all MPI operations become NOPs where possible so that, the application can bubble the error to the top level as fast as possible.

  49. A bigger meaner example (PSTSWM) • Thirdly, the run-time system spawns a replacement node on behalf of the application using a predetermined set of metrics. • Finally, the system allows this new process to be combined with the surviving communicators at MPI_Comm_create time. • Position (rank) of the new processes is not so important in this application as restart data has to be redistributed anyway, but maybe important for other applications.

  50. A bigger meaner example (PSTSWM) • For this to occur, we need a means of identifying if a process has been spawned for the purpose of recovery (by either the run-time system or an application itself). • MPI_Comm_split (com, ft_mpi_still_alive,..) vs • MPI_Comm_split (ft_mpi_external_com, ft_mpi_new_spawned,..) • PSTSWM, doesn’t care which task died and frankly doesn’t want to know! • Just wants to continue calculating..

More Related