40 likes | 205 Views
MPICH-GF: Providing Fault Tolerance on Grid Environments. Collective Operations. P2P Operations. ADI. globus2. ft-globus. ch_shmem. Atomic Message Transfer. MPI_Rejoin(). ch_p4. Channel Info Update Module. Message Logging / Replay Modules. libckpt.
E N D
MPICH-GF: Providing Fault Tolerance on Grid Environments Collective Operations P2P Operations ADI globus2 ft-globus ch_shmem Atomic Message Transfer MPI_Rejoin() ch_p4 Channel Info Update Module Message Logging / Replay Modules libckpt Namyoon Woo, Soonho Choi, hyungsoo Jung, Jungwhan Moon, Heon Y. Yeom, Taesoon Park and Hyungwoo Park Seoul Nat’l Univ, Korea. http://dcslab.snu.ac.kr/projects/grid {nywoo,shchoi, jhs, jhmoon,yeom}@arirang.snu.ac.kr, tspark@sejong.ac.kr, hwpark@hpcnet.ne.kr Objectives Implementation of fault tolerant MPICH-G2 based on checkpointing technique Features • User transparency • Direct communication type MPI implementation • Both non-blocking and blocking operations are supported. • Checkpointing without channel closing Structure MPICH-GF Previous Works Indirect Communication Method Direct Communication Method CoCheck Framework Starfish MPI/FT Batchu Egida FT-MPI Fagg MPI API RENEW Hector Virtual Device Comm. Lib. MPICH-GF MPI-FT Loucag MPICH-V Log-based CC Log-based CIC CC
CoordinatedCheckpointing Recovery Protocol Process Process checkpoint timing start Fail - monitoring SIGCHLD Signal : ckpt request job to ready state information callback (ABNORMAL END) for checkpoint - job state check Barrier() waitpid() • ABNORMAL END Do checkpointing Checkpointing Complete request for job cancel • job cancel • globus_module_deactivate Collect n complete messages • subjob_label free • subjob_state free • job contact free • change RSL * changing RSL - replace executable file name - set a base ckpt file name - set a machine name for migration Confirm Job Resubmit Channel Reconstruction MPI_Rejoin() 2. Decide Rollback Point Recovery 1. Failure event report 4. Job Resubmit 7. Channel Update 3. Rollback Request 5. Job Recreate 6. MPID_Rejoin() failed task task Mesg. Send task New task 1. Channel Invalidation Atomicity of Message Transfer • MPICH-GF does not allow checkpointing with send queue un-empty. • ‘Message transfer module’ and ‘checkpointing module’ are mutually exclusive. Deadlock in Coordinated Checkpointing
What Is Checkpointing? • An operation to store the states of processes into stable storage for the purpose of recovery or migration Consistent Recovery • The states of processes may have causal dependency by message passing. • Hence, the consistency of states should be considered in recovery. • A consistent global checkpoint includes no orphan-message Coordinated Checkpointing Message Logging P P P P P P P P coordination C(1,1) C(1,2) P1 m1 m3 m4 Failure C(2,1) P2 rollback-recovery m2 P3 C(3,1) Pi : Process, : Checkpoint Checkpointing / Logging System Configuration Linux Kernel v2.4 Globus 2.2 MPICH v1.2.3 Performance Total Execution Time Checkpointing Overhead Recovery Cost