HARNESS and Fault Tolerant MPI

HARNESS and Fault Tolerant MPI Graham E Fagg A Joint project with Emory University and ORNL

Harness and FT-MPI • A little on HARNESS and plug-ins for FT-MPI • Why FT-MPI • FT-MPI • G_hcore • Conclusions and Futures

HARNESS • HARNESS • Heterogeneous Adaptable Reconfigurable NEtworked SystemS • Also described as a • “distributed, reconfigurable and heterogeneous computing environment that supports dynamically adaptable parallel applications”

HARNESS • Distributed Virtual Machines (DVMs) built from HARNESS core (hcore) services provided by daemons • Think back to PVM and pvmds • A local DVM -> Personal Grid?

HARNESS

HARNESS Plug-in repository

HARNESS • Research project so there are different versions of the hcore • Emory has produced a Java DVM system • Vaidy talked about this earlier • ORNL a C based system to experiment based on a Symmetric Control Algorithm • ICL/UTK a C based system to experiment with FT-MPI, replication of MetaData and Remote Invocation techniques • More later

FT-MPI • Why FT-MPI and what is it? • A Fault tolerant MPI implementation • Harness needs an MPI implementation • Why not make it as survivable and dynamic as Harness ?

FT-MPI • Current MPI applications live under the MPI fault tolerant model of no faults allowed. • This is great on an MPP as if you lose a node you generally lose a partition/job anyway. • Makes reasoning about results easy. If there was a fault you might have received incomplete/incorrect values and hence have the wrong result anyway.

FT-MPI • Do we need a FT version? • As we push towards PetaFlop systems with 10,000-100,000 + nodes the MTBFs starts becoming a problem • Pacific Blue Benchmark of 5800 CPUs for almost 7 hours took only two attempts due to hardware failures..

FT-MPI • Real goals: • an efficient MPI implementation for Harness • A test bed to develop a new generation of dynamic parallel algorithms on.

Semantics of Failure • What is a failure ? • How do we detect failures? • What would they mean to communicators and communications? • Who is responsible for handling them? • How can we handle them?

Semantics of Failure • What is a failure ? • Direct loss of a MPI process • crash of application sub-program • lose of harness core • lost of a physical node

Semantics of Failure • What is a failure ? • Loss of communications with a node • crash of application sub-program • lose of harness core • lost of a physical node / NIC • partitioning of the network

Semantics of Failure • What would they mean to communicators and communications? • Communicators are invalidated if there is a failure • They can be reused or rebuilt before they are valid again • Yes it means that there are operations that you can call on an invalid communicator.

Semantics of Failure • Who is responsible for handling them? • The users application is responsible unless it has indicated what to the system how to handle it for them.

Semantics of Failure • Constraints on what we can do: • We support the MPI-1.X (and some of MPI-2) API • current code should drop in unchanged • I.e. we avoid changing the MPI API but instead change the semantics of some of the calls and overload other, introduce new constants etc..

Semantics of Failure • Communicators and communications within the communicators follow modes of operation upon errors that are based on their states. • There are two types of mode that are controllable by the application • Communicator and communication modes

Semantics of Failure • Communicator states and modes • Under normal MPI • initialized?->OK->failed or exit (dead either way) • Under FT-MPI • FT_OK, FT_DETECTED, FT_RECOVER, FT_RECOVERED, FT_FAILED • or JUST • OK -> problem -> fail/dead or OK

Semantics of Failure • Communicator states and modes • Modes set using MPI attribute calls • Modes: SHRINK, BLANK, REBUILD and ABORT • ABORT… default MPI behavior

Semantics of Failure • Communicator states and modes • SHRINK • On a rebuild this forces the missing process to disappear from the communicator • Size changes, also process ranks may change

Semantics of Failure • Communicator states and modes : SHRINK

Semantics of Failure • Communicator states and modes : BLANK • Rebuild the communicator so that gaps are allowed • Size returns extent of communicator • P2P operations to a gap fail • collective operations will work but beware on what you think the result should be...

Semantics of Failure • Communicator states and modes : REBUILD • Automatic node (spawning/) recovery when you rebuild a communicator that has died • new process is inserted either in gap or at end • New process is notified by return value from MPI_Init • yes check that value

Semantics of Failure • Communicator states and modes • How do we know? • MPI operation returns MPI_ERR_OTHER • Then you have to check attributes of the communicator

Semantics of Failure • Communicator states and modes • How do we get from one state to another? • A communicator operation such as: • MPI_Comm_split, MPI_Comm_create, MPI_Comm_dup • MPI_COMM_WORLD can rebuilt itself !

Semantics of Failure • Communicator states and modes rc= MPI_Send (----, com); If (rc==MPI_ERR_OTHER) { MPI_Comm_dup (com, newcom); MPI_Comm_free (com); com = newcom; /* continue.. */ /* retry Send on com here.. */ }

rc = MPI_Bcast ( initial_work….); if(rc==MPI_ERR_OTHER)reclaim_lost_work(…); while ( ! all_work_done) { if (work_allocated) { rc = MPI_Recv ( buf, ans_size, result_dt, MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &status); if (rc==MPI_SUCCESS) { handle_work (buf); free_worker (status.MPI_SOURCE); all_work_done--; } else { reclaim_lost_work(status.MPI_SOURCE); if (no_surviving_workers) { /* ! do something ! */ } } } /* work allocated */ /* Get a new worker as we must have received a result or a death */ rank=get_free_worker_and_allocate_work(); if (rank) { rc = MPI_Send (… rank… ); if (rc==MPI_OTHER_ERR) reclaim_lost_work (rank); if (no_surviving_workers) { /* ! do something ! */ } } /* if free worker */ } /* while work to do */

communicator workers Master

Semantics of Failure • Communication states and message modes • How communications are handled can also be controlled • Just because a communicator has a problem does not mean the application halts until it is fixed..

Semantics of Failure • Communication states and message modes • Two flavors • CONTINUE (cont) • NO-OP (NOP)

Semantics of Failure • Communication states and message modes • CONT • All messages that can be sent are sent • (You always get to receive if a message is already waiting for you)

Semantics of Failure • Communication states and message modes • NOP • You can not initiate any NEW communications • previous operations should complete if they are still valid • Designed to allow the thread of control for a failed application to float up layers as fast as possible.

Semantics of Failure • The layer problem • Made worse by good software engineering and the use of multiple nested libraries.

Semantics of Failure Build an unstructured grid

Semantics of Failure Build an unstructured grid Distribute some work

Semantics of Failure Build an unstructured grid Distribute some work Solve my part

Semantics of Failure Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) …..

Semantics of Failure Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) ….. Someone died somewhere

Semantics of Failure I can fix it Up here Build an unstructured grid Distribute some work Solve my part Do I=0, XXX …. MPI_Sendrecv ( ) ….. Not down here

Semantics of Failure I can fix it Up here Build an unstructured grid Distribute some work Solve my part NOPs Allow me To get out of this part FAST Do I=0, XXX …. MPI_Sendrecv ( ) …..

Semantics of Failure • Communication states and message modes Collective operations are dealt with differently than p2p Will only return if the operation would have given the same answer as if no failure occurred for the surviving members

Semantics of Failure • Communication states and message modes Collective operations in two classes broadcast / scatter succeed if non root node fails and the data survives gather / reduce fail if there is an error

HARNESS and Fault Tolerant MPI

HARNESS and Fault Tolerant MPI

Presentation Transcript

Fault-Tolerant Broadcast

Open MPI - A High Performance Fault Tolerant MPI Library

Building Algorithmically Nonstop Fault Tolerant MPI Programs

Fault-Tolerant Broadcast

Fault Tolerant MPI in High Performance Computing: Semantics and Applications

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Coordinated Checkpoint Versus Message Log For Fault Tolerant MPI

A Multi-Protocols Fault Tolerant MPI

Replication and Fault Tolerant

FAULT-TOLERANT COMPUTING

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-Tolerant Broadcast

Fault-tolerant Computing