390 likes | 517 Views
Fault Tolerant MPI. Anthony Skjellum *$ , Yoginder Dandass $ , Pirabhu Raman * MPI Software Technology, Inc * Misissippi State University $ FALSE2002 Workshop November 14, 2002. Outline. Motivations Strategy Audience Technical Approaches Summary and Conclusions. Motivations for MPI/FT.
E N D
Fault Tolerant MPI Anthony Skjellum*$, Yoginder Dandass$, Pirabhu Raman* MPI Software Technology, Inc* Misissippi State University$ FALSE2002 Workshop November 14, 2002
Outline • Motivations • Strategy • Audience • Technical Approaches • Summary and Conclusions
Motivations for MPI/FT • Well written and well tested legacy MPI applications will abort, hang or die more often in harsh or long-running environments because of extraneously introduced errors. • Parallel Computations are Fragile at Present • There is apparent demand for recovery of running parallel applications • Learn how “fault tolerant” can we make MPI programs and implementations without abandoning this programming model
Strategy • Build on MPI/Pro, a commercial MPI • Support extant MPI applications • Define application requirements/subdomains • Do a very good job for Master/Slave Model First • Offer higher-availability services • Harden Transports • Work with OEMs to offer more recoverable services • Use specialized parallel computational models to enhance effective coverage • Exploit third-party checkpoint/restart, nMR, etc. • Exploit Gossip for Detection
Audience • High Scale, Higher Reliability Users • Low Scale, Extremely High Reliability Users (nMR involved for some nodes) • Users of clusters for production runs • Users of embedded multicomputers • Space-based, and highly embedded settings • Grid-based MPI applications
Detection and Recovery From Extraneously Induced Errors Errors/Failures DETECTION APPLICATION EXECUTION MODEL SPECIFICS Application ABFT/aBFT MPI Sanity MPI Network, Drivers, NIC N/W Sanity OS, R/T, Monitors WATCHDOG/BIT/OTHER RECOVERY Process Application Recovers
my.app finishes my.app hangs my.app aborts mpirun finishes (success) mpirun hangs mpirun finishes (failure) (failure) Coarse Grain Detection and Recovery (adequate if no SEU) process dies no error mpirun –np NP my.app MPI-lib error aborted run ? hung job ? y y abort my.app RECOVERY RECOVERY n n continue waiting send it to ground DETECTION
Example: NIC errors send buf recv buf r0 r1 user level user level MPI MPI device level device level NIC NIC 2nd highest SEU strike-rate after main cpu • Legacy MPI applications will be run in simplex mode
“Obligations” of a Fault-Tolerant MPI • Ensure Reliability of Data Transfer at the MPI Level • Build Reliable Header Fields • Detect Process Failures • Transient Error Detection and Handling • Checkpointing support • Two-way negotiation with scheduler and checkpointing components • Sanity checking of MPI applications and underlying resources (non-local)
Low Level Detection Strategy for Errors and Dead Processes A InitiateDevice Level Communication Low Level Success ? y ReturnMPI_Success n Ask SCT: Is Peer Alive ? Ask EH: Recoverable? Error Type ? Other Timeout n n Trigger Event y Trigger Event y EH : Error Handler Reset Timeout SCT: Self-Checking Thread Reset Timeout A A
Design Choices Message Replication (nMR to simplex) Replicated ranks send/receive messages independently from each other One copy of the replicated rank acts as the message conduit Message Replication (nMR to nMR) Replicated ranks send/receive messages independently from each other One copy of the replicated rank acts as the message conduit
Parallel nMR Advantages • Voting on messages only (not on each word of state) A B C n=3; np=4 • Local errors remain local • Requires two failures to fail (e.g., A0 and C0)
MFT-IIIs: Master/Slave with MPI-1.2 • Master (rank 0) is nMR • MPI_COMM_WORLD is maintained in nMR • MPI-1.2 only • Application cannot actively manage processes • Only middleware can restart processes • Pros: • Supports send/receive MPI-1.2 • Minimizes excess messaging • Largely MPI application transparent • Quick recovery possible • ABFT based process recovery assumed. • Cons: • Scales to O(10) Ranks only • Voting still limited • Application explicitly fault aware
MFT-IVs: Master/Slave with MPI-2 • Master (rank 0) is nMR • MPI_COMM_WORLD is maintained in nMR • MPI_COMM_SPAWN() • Application can actively restart processes • Pros: • Supports send/receive MPI-1.2 + DPM • Minimizes excess messaging • Largely MPI application transparent • Quick recovery possible, simpler than MFT-IIIs • ABFT based process recovery assumed. • Cons: • Scales to O(10) Ranks only • Voting still limited • Application explicitly fault aware
Checkpointing the Master for recovery • Master checkpoints • Voting on master liveness • Master failure detected • Lowest rank slave restarts master from checkpointed data • Any of the slaves could promote and assume the role of master • Peer liveness knowledge required to decide the lowest rank • Pros: • Recovery independent of the number of faults • No additional resources • Cons: • Checkpointing further reduces scalability • Recovery time depends on the checkpointing frequency Rank 0 Storage Medium Message from 1 Message from 0 to 2 to 0 Rank 1 Rank 2 Rank n (Slave) (Slave) (Slave) MPI messages Checkpointing data
Rank 0 Message from 1 Message from 0 to 2 to 0 Rank 1 Rank 2 Rank n (Slave) (Slave) (Slave) SM SM SM Flow of MPI messages Checkpointing data SM Storage Medium Checkpointing Slaves for Recovery(Speculative) • Slaves checkpoint periodically at a low frequency • Prob. of failure of a slave > prob. of failure of the master • Master failure detected • Recovered from data checkpointed at various slaves • Peer liveness knowledge required to decide the lowest rank • Pros: • Checkpointing overhead of master eliminated • Aids in faster recovery of slaves • Cons: • Increase in Master recovery time • Increase in overhead due to checkpointing of all the slaves • Slaves are stateless and hence checkpointing slaves doesn’t help in anyway in restarting the slaves • Checkpointing at all the slaves could be really expensive • Instead of checkpointing slaves could return the results tto the master
Replicated Rank 0 Rank 0 Rank 0 Rank 0 Storage Medium Message from 1 Message from 0 to 2 to 0 Rank 1 Rank 2 Rank n (Slave) (Slave) (Slave) Logical flow of MPI messages Actual flow of MPI message Checkpointing data Adaptive checkpointing and nMR of the master for recovery • Start with ‘n’ replicates • Initial Checkpointing calls generate No-ops • Slaves track the liveness of master and the replicates • Failure of last replicate initiates checkpointing • Pros: • Tolerates ‘n’ faults with negligible recovery time • Subsequent faults can still be recovered • Cons: • Increase in overhead of tracking the replicates
Self-Checking Threads(Scales > O(10) nodes can be considered) Invoked by MPI library Queries by coordinator • Vote on communicator state • Check buffers • Check queues for aging • Check local program state • Invoked Periodically • Invoked when suspicion arises • Checks whether peers are alive • Checks for network sanity • Server to coordinator queries • Exploits timeouts • Periodic execution, no polling • Provides heart-beat across app. • Can check internal MPI state
MPI/FT SCT Support Levels • Simple, Non-portable, uses internal state of MPI and/or system (“I”) • Simple, Portable, exploits threads and PMPI_ (“II”) or PERUSE API • Complex state checks, Portable exploits queue interfaces (“III”) • All of the above
MPI/FT Coordinator • Spawned by mpirun or similarly • Closely coupled / friend with the MPI library of application • User Transparent • Periodically collects status information from the application • Can kill and restart the application or individual ranks • Preferably implemented using MPI-2 • We’d like to replicate and distribute this functionality
Use of Gossip in MPI/FT • Applications in Model III and IV assume a star topology • Gossip requires a virtual all-to-all topology • Data network may be distinct from the control network • Gossip provides: • Potentially scalable and fully distributed scheme for failure detection and notification with reduced overhead • Notification of failures in the form of broadcast
1 2 3 S No update Heartbeat 0 < 2 Gossip 1 2 3 1 2 3 S S Gossip-based Failure Detection Tcleanup 5 * Tgossip S - Suspect vector Node 2’s Data Node 1’s Data 1 0 1 5 4 3 2 Heartbeat 3 > 0 Update Heartbeat 4>2 Update Node 3’s Data Node dead !!! Node 2’s Data 3 3 2 Clock 3 5 4 4 Cycles Elapsed : 1 2 3 1
Consensus about Failure 1 2 3 Node 3 dead L At Nodes 1 and 2 At Node 1 1 2 3 L 0 Suspect matrices merged at Node 1 1 2 3 L At Node 2 L – Live list
Issues with Gossip - I • After node a fails • If node b, the nodethat arrives at consensus on node a’sfailure last (notification broadcaster) also fails before broadcast • Gossiping continues until another node, c,suspects that node b has failed • Node c broadcasts the failure notification of node a • Eventually node b is also determined to have failed
Issues with Gossip - II • If control and data networks are separate: • MPI progress threads monitor the status of the data network • Failure of the link to the master is indicated when communication operations timeout • Gossip monitors the status of the control network • The progress threads will communicate the suspected status of the master node to the gossip thread • Gossip will incorporate this information in its own failure detection mechanism
Issues with Recovery • If network failure causes the partitioning of processes: • Two or more isolated groups may form that communicate within themselves • Each group assumes that the other processes have failed and attempts recovery • Only the group that can reach the checkpoint data is allowed to initiate recovery and proceed • The issue of recovering when multiple groups can access the checkpoint data is under investigation • If only nMR is used, the group with the master is allowed to proceed • The issue of recovering when the replicated master processes are split between groups is under investigation
Shifted APIs • Try to “morally conserve” MPI standard • Timeout parameter added to messaging calls to control the behavior of individual MPI calls • Modify existing MPI calls • Add new calls with the added functionality to support idea • Add a callback function to MPI calls (for error handling) • Modify existing MPI calls • Add new calls with the added functionality • Support in-band or out-of-band error management made explicit to application • Runs in concert with MPI_ERRORS_RETURN. • Offers opportunity to give hints as well, where meaningful.
Application-based Checkpoint • Point of synchronization for a cohort of processes • Minimal fault tolerance could be applied only at such checkpoints • Defines “save state” or “restart data” needed to resume • Common practice in parallel CFD and other MPI codes, because of reality of failures • Essentially gets no special help from system • Look to Parallel I/O (MPI-2) for improvement • Why? Minimum complexity of I/O + Feasible
In situ Checkpoint Options • Checkpoint to bulk memory • Checkpoint to flash • Checkpoint to other distributed RAM • Other choices? • Are these useful … depends on error model
Early Results with Hardening Transport:CRC vs. time-based nMR Comparison of nMR, CRC with baseline using MPI/Pro (version 1.6.1-1tv) MPI/Pro Comparisons of Time Ratios, normalized against baseline performance
Early Results, II. Comparison of nMR and CRC with baseline using MPICH (version 1.2.1) MPICH Comparison of Time Ratios Using baseline MPI/Pro Timings
Early Results, IIItime-based nMR with MPI/Pro Total Time for 10,000 Runs vs Message Size for Various nMR MPI/Pro Time Ratio Comparisons for various nMR to baseline
Other Possible Models • Master Slave was considered before • Broadcast/Reduce Data Parallel Apps. • Independent Processing + Corner Turns • Ring Computing • Pipeline Bi-Partite Computing • General MPI-1 models (all-to-all) • Idea: Trade Generality for Coverage
What about Receiver-Based Models? • Should we offer, instead or in addition to MPI/Pro, a receiver-based model? • Utilize publish/subscribe semantics • Bulletin boards? Tagged messages, etc. • Try to get rid of single point of failure this way • Sounds good, can it be done? • Will anything like an MPI code work? • Does anyone code this way!? (e.g., Java Spaces, Linda, military embedded distributed computing)
Plans for Upcoming 12 Months • Continue Implementation of MPI/FT to Support Applications in simplex mode • Remove single points of failure for master/slave • Support for multi-SPMD Models • Explore additional application-relevant models • Performance Studies • Integrate fully with Gossip protocol for detection
Summary & Conclusions • MPI/FT = MPI/Pro + one or more availability enhancements • Fault-tolerant concerns leads to new MPI implementations • Support for Simplex, Parallel nMR and/or Mixed Mode • nMR not scalable • Both Time-based nMR and CRC (depends upon message size and the MPI implementation) - can do now • Self Checking Threads - can do now • Coordinator (Execution Models) - can do very soon • Gossip for detection – can do, need to integrate • Shifted APIs/callbacks - easy to do, will people use? • Early Results with CRC vs. nMR over TCP/IP cluster shown
Related Work • G. Stellner (CoCheck, 1996) • Checkingpointing that works with MPI and Condor • M. Hayden (The Ensemble System, 1997) • Next-generation Horus communication toolkit • Evripidou et al (A Portable Fault Tolerant Scheme for MPI, 1998) • Redundant processes approach to masking failed nodes • A. Agbaria and R. Friedman, (Starfish, 1999) • Event bus, works in specialized language, related to Ensemble • G.F. Fagg, and J.J. Dongarra, (FT-MPI, 2000) • Growing/shrinking communicators in response to node failures, memory-based checkpointing, reversing calculation? • G. Bosilca et al, (MPICH-V, 2002) – new, to be presented at SC2002 • Grid-based modifications to MPICH for “volatile nodes” • Automated checkpoint, rollback, message logging • Also, substantial literature related to ISIS/HORUS (Dr. Berman et al at Cornell) that is interesting for distributed computing