760 likes | 982 Views
HARNESS and Fault Tolerant MPI. Graham E Fagg A Joint project with Emory University and ORNL. Harness and FT-MPI. A little on HARNESS and plug-ins for FT-MPI Why FT-MPI FT-MPI G_hcore Conclusions and Futures. HARNESS. HARNESS
E N D
HARNESS and Fault Tolerant MPI Graham E Fagg A Joint project with Emory University and ORNL
Harness and FT-MPI • A little on HARNESS and plug-ins for FT-MPI • Why FT-MPI • FT-MPI • G_hcore • Conclusions and Futures
HARNESS • HARNESS • Heterogeneous Adaptable Reconfigurable NEtworked SystemS • Also described as a • “distributed, reconfigurable and heterogeneous computing environment that supports dynamically adaptable parallel applications”
HARNESS • Distributed Virtual Machines (DVMs) built from HARNESS core (hcore) services provided by daemons • Think back to PVM and pvmds • A local DVM -> Personal Grid?
HARNESS Plug-in repository
HARNESS • Research project so there are different versions of the hcore • Emory has produced a Java DVM system • Vaidy talked about this earlier • ORNL a C based system to experiment based on a Symmetric Control Algorithm • ICL/UTK a C based system to experiment with FT-MPI, replication of MetaData and Remote Invocation techniques • More later
FT-MPI • Why FT-MPI and what is it? • A Fault tolerant MPI implementation • Harness needs an MPI implementation • Why not make it as survivable and dynamic as Harness ?
FT-MPI • Current MPI applications live under the MPI fault tolerant model of no faults allowed. • This is great on an MPP as if you lose a node you generally lose a partition/job anyway. • Makes reasoning about results easy. If there was a fault you might have received incomplete/incorrect values and hence have the wrong result anyway.
FT-MPI • Do we need a FT version? • As we push towards PetaFlop systems with 10,000-100,000 + nodes the MTBFs starts becoming a problem • Pacific Blue Benchmark of 5800 CPUs for almost 7 hours took only two attempts due to hardware failures..
FT-MPI • Real goals: • an efficient MPI implementation for Harness • A test bed to develop a new generation of dynamic parallel algorithms on.
Semantics of Failure • What is a failure ? • How do we detect failures? • What would they mean to communicators and communications? • Who is responsible for handling them? • How can we handle them?
Semantics of Failure • What is a failure ? • Direct loss of a MPI process • crash of application sub-program • lose of harness core • lost of a physical node
Semantics of Failure • What is a failure ? • Loss of communications with a node • crash of application sub-program • lose of harness core • lost of a physical node / NIC • partitioning of the network
Semantics of Failure • What would they mean to communicators and communications? • Communicators are invalidated if there is a failure • They can be reused or rebuilt before they are valid again • Yes it means that there are operations that you can call on an invalid communicator.
Semantics of Failure • Who is responsible for handling them? • The users application is responsible unless it has indicated what to the system how to handle it for them.
Semantics of Failure • Constraints on what we can do: • We support the MPI-1.X (and some of MPI-2) API • current code should drop in unchanged • I.e. we avoid changing the MPI API but instead change the semantics of some of the calls and overload other, introduce new constants etc..
Semantics of Failure • Communicators and communications within the communicators follow modes of operation upon errors that are based on their states. • There are two types of mode that are controllable by the application • Communicator and communication modes
Semantics of Failure • Communicator states and modes • Under normal MPI • initialized?->OK->failed or exit (dead either way) • Under FT-MPI • FT_OK, FT_DETECTED, FT_RECOVER, FT_RECOVERED, FT_FAILED • or JUST • OK -> problem -> fail/dead or OK
Semantics of Failure • Communicator states and modes • Modes set using MPI attribute calls • Modes: SHRINK, BLANK, REBUILD and ABORT • ABORT… default MPI behavior
Semantics of Failure • Communicator states and modes • SHRINK • On a rebuild this forces the missing process to disappear from the communicator • Size changes, also process ranks may change
Semantics of Failure • Communicator states and modes : SHRINK
Semantics of Failure • Communicator states and modes : SHRINK
Semantics of Failure • Communicator states and modes : SHRINK
Semantics of Failure • Communicator states and modes : BLANK • Rebuild the communicator so that gaps are allowed • Size returns extent of communicator • P2P operations to a gap fail • collective operations will work but beware on what you think the result should be...
Semantics of Failure • Communicator states and modes : REBUILD • Automatic node (spawning/) recovery when you rebuild a communicator that has died • new process is inserted either in gap or at end • New process is notified by return value from MPI_Init • yes check that value
Semantics of Failure • Communicator states and modes • How do we know? • MPI operation returns MPI_ERR_OTHER • Then you have to check attributes of the communicator
Semantics of Failure • Communicator states and modes • How do we get from one state to another? • A communicator operation such as: • MPI_Comm_split, MPI_Comm_create, MPI_Comm_dup • MPI_COMM_WORLD can rebuilt itself !
Semantics of Failure • Communicator states and modes rc= MPI_Send (----, com); If (rc==MPI_ERR_OTHER) { MPI_Comm_dup (com, newcom); MPI_Comm_free (com); com = newcom; /* continue.. */ /* retry Send on com here.. */ }
rc = MPI_Bcast ( initial_work….); if(rc==MPI_ERR_OTHER)reclaim_lost_work(…); while ( ! all_work_done) { if (work_allocated) { rc = MPI_Recv ( buf, ans_size, result_dt, MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &status); if (rc==MPI_SUCCESS) { handle_work (buf); free_worker (status.MPI_SOURCE); all_work_done--; } else { reclaim_lost_work(status.MPI_SOURCE); if (no_surviving_workers) { /* ! do something ! */ } } } /* work allocated */ /* Get a new worker as we must have received a result or a death */ rank=get_free_worker_and_allocate_work(); if (rank) { rc = MPI_Send (… rank… ); if (rc==MPI_OTHER_ERR) reclaim_lost_work (rank); if (no_surviving_workers) { /* ! do something ! */ } } /* if free worker */ } /* while work to do */
communicator workers Master
communicator workers Master
communicator workers Master
communicator workers Master
communicator workers Master
Semantics of Failure • Communication states and message modes • How communications are handled can also be controlled • Just because a communicator has a problem does not mean the application halts until it is fixed..
Semantics of Failure • Communication states and message modes • Two flavors • CONTINUE (cont) • NO-OP (NOP)
Semantics of Failure • Communication states and message modes • CONT • All messages that can be sent are sent • (You always get to receive if a message is already waiting for you)
Semantics of Failure • Communication states and message modes • NOP • You can not initiate any NEW communications • previous operations should complete if they are still valid • Designed to allow the thread of control for a failed application to float up layers as fast as possible.
Semantics of Failure • The layer problem • Made worse by good software engineering and the use of multiple nested libraries.
Semantics of Failure Build an unstructured grid
Semantics of Failure Build an unstructured grid Distribute some work
Semantics of Failure Build an unstructured grid Distribute some work Solve my part
Semantics of Failure Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) …..
Semantics of Failure Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) ….. Someone died somewhere
Semantics of Failure I can fix it Up here Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) ….. Not down here
Semantics of Failure I can fix it Up here Build an unstructured grid Distribute some work Solve my part NOPs Allow me To get out of this part FAST Do I=0, XXX …. MPI_Sendrecv ( ) …..
Semantics of Failure • Communication states and message modes Collective operations are dealt with differently than p2p Will only return if the operation would have given the same answer as if no failure occurred for the surviving members
Semantics of Failure • Communication states and message modes Collective operations in two classes broadcast / scatter succeed if non root node fails and the data survives gather / reduce fail if there is an error