100 likes | 183 Views
Hierarchical Coordinated Checkpointing Protocol. Himadri Sekhar Paul. Arobinda Gupta. R. Badrinath . Dept. of Computer Sc. & Engg. Indian Institute of Technology, Kharagpur, INDIA 721302. <hpaul,agupta,badri>@cse.iitkgp.ernet.in. Motivation.
E N D
Hierarchical Coordinated Checkpointing Protocol Himadri Sekhar Paul. Arobinda Gupta. R. Badrinath. Dept. of Computer Sc. & Engg. Indian Institute of Technology, Kharagpur, INDIA 721302. <hpaul,agupta,badri>@cse.iitkgp.ernet.in
Motivation • Long running application executing on Distributed Systems. • Metacomputer running over WAN. • Prone to failure, fault tolerance is important. • Checkpoint and recovery technique.
Motivation • Coordinated Checkpointing protocol is a popular scheme. • Coordinated checkpointing protocol is bottlenecked by the slowest link in the network. • Hierarchical Coordinated Checkpointing Protocol caters for the heterogeneous link speed, as in WAN.
System Model • Nodes are fail-safe. • Network is immune to partitioning. • Links are unreliable. • All computing nodes are reachable from the others. • Network is hierarchically connected • Clusters of computing nodes realized by high speed networks. • Clusters inter-connected by lower speed networks.
Computation Nodes Cluster System Model
Coordinator Ckpt Estb Ckpt Rqst Ack Ckpt Estb Ack Ckpt Rqst Follower Follower Process blocked … Message Checkpoint Flat Coordinated CheckpointingProtocol(2-phase commit)
Initiator AckCkpt_rqst AckCkpt_estb AckCkpt_commit Ckpt_rqst Ckpt_estb Ckpt_commit Follower AckCkpt_rqst AckCkpt_commit Leader AckCkpt_rqst Ckpt_commit AckCkpt_estb Ckpt_rqst AckCkpt_commit Ckpt_estb Follower Message Blocking at Extra-cluster msg Blocked Checkpoint Hierarchical CoordinatedCheckpointing Protocol
Simulation Result • Simulation Setup • Two level network, with intra-cluster link speed of 10 Mbps and inter-cluster link speed of 1 Mbps. • Communication pattern of the application is random. • Varying fraction of extra-cluster application message. (Flat = Flat Coordinated Checkpointing Protocol) (Hier = Hierarchical Coordinated Checkpointing Protocol)
Conclusion & Future Work • In a two-level hierarchical network the hierarchical checkpointing protocol incurs less latency than the flat checkpointing protocol, even for very high communication intensity. • The protocol can be extended to a generic hierarchical network.