FT-MPI

FT-MPI Graham E Fagg Making of the holy grail or a YAMI that is FT

FT-MPI • What is FT-MPI (its no YAMI) • Building an MPI for Harness • First sketch of FT-MPI • Simple FT enabled Example • A bigger meaner example (PSTSWM) • Second view of FT-MPI • Future directions

FT-MPI is not just a YAMI • FT-MPI as in Fault Tolerant MPI • Why make a FT version? • Harness is going to be very robust compared to previous systems. • No single point of failure unlike PVM • Allow MPI users to take advantage of this high level of robustness, rather than just provide Yet Another MPI Implementation (YAMI)

Why FT-MPI • Current MPI applications live under the MPI fault tolerant model of no faults allowed. • This is great on an MPP as if you lose a node you generally lose a partion anyway. • Makes reasoning about results easy. If there was a fault you might have received incomplete/incorrect values and hense have the wrong result anyway.

Why FT-MPI • No-matter how we implement FT-MPI, it must follow current MPI-1.2 (or 2) practices. I.e. we can’t really change too much about how it works (semmantics) or how it looks (syntax). • Makes coding for FT a little interesting and very dependent on the target application classes. As will be shown.

So first what does MPI do? • All communication is via a communicator • Communicators form an envelope in which communication can occur, and contains information such as process groups, topology information and attributes (key values)

What does MPI do? • When an application starts up, it has a single communicator that contains all members known an MPI_COMM_WORLD • Other communicators containing sub-section of the original communictor can be created from this communicator using collective (meaning blocking, group operations).

What does MPI do? • Until MPI-2 and the advent of MPI_Spawn (which isnot really supported by any implementations except LAM) it was not possible to add new members to the range of addressable members in an MPI application. • If you can’t address (name) them, you can’t communicate directly with them.

What does MPI do? • If a member of a communicator failed for some reason, the specification mandated that rather than continuing which would lead to unknown results in a doomed application, the communicator is invalidated and the application halted in a clean manner. • In simple if something fails, everything does.

What we would like? • Many applications are capable or can be made capable of surviving such a random failure. • Initial Goal: • Provide a version of MPI that allows a range of alternatives to an application when a sub-part of the application has failed. • Range of alternatives depends on how the applications themselves will handle the failure.

Building an MPI for Harness • Before we get into the gritty of what we do when we get an error, how are we going to build something in the first place? • Two methods: • Take an existing implementation (ala MPICH) and re-engineer it for our own uses (the most popular method currently) • Build an implementation from the ground up.

Building a YAMI • Taking MPICH and building an FT version should be simple…? • It has a layering system, the MPI API sits on top of the data-structures that sit ontop of a collective communication model, which calls an ADI that provides p2p communications.

Building a YAMI • MSS tried this with their version of MPI for the Cray T3E • Found that the layering was not very clean, lots of short cuts and data passed between the layers without going through the expected APIs. • Esp true of routines that handle startup (I.e. process management)

Building a YAMI • Building a YAMI from scratch • Not impossible but time consuming • Too many function calls to support (200+) • Can implement a subset (just like compiler writers did for HPF with subset HPF) • If we later want a *full* implementation then we need a much larger team that we current have. (Look at how long it has taken ANL to keep up to date, and look at their currently outstanding bug list).

Building a YAMI • Subset of operations best way to go • Allows us to test a few key applications and find out just how useful and applicable a FT-MPI would be.

Building an MPI for Harness • What does Harness give us, and what do we have to build ourselves? • Harness will give us basic functionality of starting tasks, some basic comms between them, some attribute storage (mboxes) and some indication of errors and failures. • I.e. mostly what PVM gives us at the moment. • As well as the ability to plug extra bits in...

Harness Basic Structure Application Application TCP/IP basic link Harness run-time Pipes / sockets TCP/IP HARNESS Daemon

Harness Basic Structure Repository Application Application Pipes / sockets TCP/IP Internal Harness meta-data storage HARNESS Daemon

Harness Basic Structure Application Application Pipes / sockets TCP/IP Internal Harness meta-data storage HARNESS Daemon

Harness Basic Structure Application Application Harness run-time FM-Comms-Plugin Pipes / sockets TCP/IP Internal Harness meta-data storage HARNESS Daemon

So what do we need to build for FT-MPI? • Build the run-time components that provide the user application with an MPI API • Build an interface in this run-time component that allows for fast communications so that we at least provide something that doesn’t run like a 3 legged dog.

Building the run-time system • The system can be built as several layers. • The top layer is the MPI API • The next layer handles the internal MPI data structures and some of the data buffering. • The next layer handles the collective communications. • Breaks them down to p2p, but in a modular way so that different collective operations can be optimised differently depending on the target architecture. • The lowest layer handles p2p communications.

Building the run-time system • Do we have any of this already? • Yes… the MPI API layer is currently in a file called MPI_Connect/src/com_layer.c • Most of the data structures are in com_list, msg_list.c, lists.c and hash.c • Hint, try compiling the library with the flag -DNOMPI • Means we know what we are up against.

Building the run-time system • Most complex part if handling the collective operations and all the variants of vector operations. • PACX and MetaMPI do not support them all, but MagPie is getting closer.

What is MagPie ? • A Black and White bird that collects shinny objects. • A software system by Thilo Kielmann of Vrije Universiteit, Amsterdam, NL. • ‘Collects’ is the important word here as its is a package that supports efficient collective operations across multiple clusters. • Most collective operation in most MPI implementation break down into a series of broadcasts which scale well across switches as long as the switches are homogeneous, which is not the case for cluster of clusters. • I.e. can use MagPie to provide the collective substrate.

Building the run-time system • Just leaves the p2p system, and the interface to the Harness daemons themselves. • The p2p system can be build on Martins fast message layer. • The Harness interface can be implemented on top of PVM 3.4 for now, until Harness itself becomes available.

Building the run-time system • Last details to worry about is how we are going to change the MPI semantics to report errors and how we continue after them. • Taking note of how we know there is a failure in the first place.

First sketch of FT-MPI • First view of FT-MPI is where the users application is able to handle errors and all we have to provide is: • A simple method for indicating errors/failures • A simple method for recovering from errors

First sketch of FT-MPI • 3 initial models of failure (another later on) • (1) There is a failure and the application is shut down (MPI default; gains us little other than meeting the standard). • (2) Failure only effects members of a communicator which communicate with the failed party. I.e. p2p coms still work within the communicator. • (3) That communicator is invalidated completely.

First sketch of FT-MPI • How do we detect failure? • 4 ways… (1) We are told its going to happen by a member of a particular application. (ie I have NaNs everywhere.. Panic) (2) A point-2-point communication fails (3) The p2p system tells use that some-one failed (error propergation within a communicator at the run-time system layer) (much like (1)) (4) Harness tells us via a message from the daemon.

First sketch of FT-MPI • How do we tell the user application? • Return it an MPI_ERR_OTHER • Force it to check an additional MPI error call to find where the failure occurred. • Via the cached attribute key values • FT_MPI_PROC_FAILED which is a vector of length MPI_COMM_SIZE of the original communicator. • How do we recover if we have just invalidated the communicator the application will use to recover on?

First sketch of FT-MPI • Some functions are allowed to be used in a partial form to facilitate recovery. • I.e. MPI_Comm_barrier ( ) can still be used to sync processes, but will only wait for the surviving processes… • The formation of a new communicator will also be allowed to work with a broken communicator. • MPI_Finalize does not need a communicator specified.

First sketch of FT-MPI • Forming a new communicator that the application can use to continue is the important part. • Two functions can modified to be used: • MPI_COMM_CREATE (comm, group, newcomm ) • MPI_COMM_SPLIT (comm, colour, key, newcomm )

First sketch of FT-MPI • MPI_COMM_CREATE ( ) • Called with the group set to a new constant • FT_MPI_LIVING (!) • Creates a new communicator that contains all the processes that continue to survive. • Special case could be to allow MPI_COMM_WORLD to be specified as both input and output communicator.

First sketch of FT-MPI • MPI_COMM_SPLIT ( ) • Called with the colour set to a new constant • FT_MPI_NOT_DEAD_YET (!) • key can be used to control the new rank of processes within the new communicator. • Again creates a new communicator that contains all the processes that continue to survive.

Simple FT enabled Example • Simple application at first • Bag of tasks, where the tasks know how to handle a failure. • Server just divides up the next set of data to be calculated between the survivors. • Clients nominate a new server if they have enough state. • (Can get the state by using ALL2ALL communications for results).

A bigger meaner example (PSTSWM) • Parallel Spectral Transform Shallow Water Model • 2D grid calculation • 3D in actual computation, with 1 axis performing FFTs, the second global reductions and the third layering sequentially upon each logical processor. • Calculation cannot support reduced grids like those supported by the Parallel Community Climate Model (PCMM), a future target application for FT-MPI. • I.e. if we lose a logical grid point (node) we must replace it!

A bigger meaner example (PSTSWM) • First Sketch ideas for FT-MPI are fine for applications that can handle a failure and have functional calling sequences that are not too deep… • I.e. MPI API calls can be buried deep within routines and any errors may take quite a while to bubble to the surface where the application can take effective action to handle them and recover.

A bigger meaner example (PSTSWM) • This application proceeds in a number of well defined stages and can only handle failure by restarting from a known set of data. • I.e. user checkpoints have to be taken, and must still be reachable. • User requirement is for the application to be started and run to completion with the system automatically handling errors without manual intervention.

A bigger meaner example (PSTSWM) • Invalidating the failed communicators only as in the first sketch are not enough for this application. • PSTSWM creates communicators for each row and column of the 2-D grid.

A bigger meaner example (PSTSWM)

A bigger meaner example (PSTSWM) Failed Node

A bigger meaner example (PSTSWM) Failed Node Failed Communicator Failed Communicator

A bigger meaner example (PSTSWM) This is unknown (butterfly p2p) Failed Node This communication works

A bigger meaner example (PSTSWM) Failed Node This is unknown as the pervious failure on the axis might not have been detected...

A bigger meaner example (PSTSWM) • What is really wanted is for four things to happen…. • Firstly, ALL communicators are marked as broken… even if some are recoverable. • The underlying system propagates errors message to all communicators, not just the ones directly effected by the failure. • Secondly all MPI operations become NOPs where possible so that, the application can bubble the error to the top level as fast as possible.

A bigger meaner example (PSTSWM) • Thirdly, the run-time system spawns a replacement node on behalf of the application using a predetermined set of metrics. • Finally, the system allows this new process to be combined with the surviving communicators at MPI_Comm_create time. • Position (rank) of the new processes is not so important in this application as restart data has to be redistributed anyway, but maybe important for other applications.

A bigger meaner example (PSTSWM) • For this to occur, we need a means of identifying if a process has been spawned for the purpose of recovery (by either the run-time system or an application itself). • MPI_Comm_split (com, ft_mpi_still_alive,..) vs • MPI_Comm_split (ft_mpi_external_com, ft_mpi_new_spawned,..) • PSTSWM, doesn’t care which task died and frankly doesn’t want to know! • Just wants to continue calculating..

FT-MPI

FT-MPI

Presentation Transcript

FT-MPICH : Providing fault tolerance for MPI parallel applications

MPI

-ft

FT-MPI

MPI

MPI

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing

5 ft X 12 ft

MPI

Building and using an FT MPI implementation

MPI

MPI

MPI

MPI

MPI

MPI

MPI

FT-MPICH : Providing fault tolerance for MPI parallel applications