A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations Sébastien Monnet, Christine Morin, Ramamurthy Badrinath PARIS Research group / IRISA Rennes IPDPS Workshop: Fault-Tolerant Parallel, Distributed and Network-Centric Systems Santa Fe. Friday, April 30th

Outline • The context: fault tolerance through checkpoint / restart • The problem: going large scale • Checkpoint / restart principles • Our contribution: a hierarchical protocol

Simulation Processing Display Simulation Simulation Context (1)Target applications

Context (2)Cluster federation Inside clusters : High performance networks (SAN) Efficient Synchronization Between clusters LAN or WAN (high delays and low bandwidth) • Large number of nodes • Short MTBF • Heterogeneous architecture

Fault tolerance • Fault: fail-stop (crash) • Two approaches • Replicate computation • Extra nodes • Checkpoint / restart protocol • Extra memory • Regular low-cost PC clusters with large memory • Checkpoint / restart

P Fault Time : checkpoint Basic principles • For a single process • Saving the process state (checkpoint) • In case of fault -> the process is restarted from its last stored checkpoint • For a parallel application • Communications -> dependencies

S0 P0 P1 : state Time Dependence due to a message • Lamport happened before relation • For a single process: events are totally ordered • Emission(M) -> reception(M) • Transitivity m S1

B A P0 P1 D C Time Recovery line: validity • Restart: finding a recovery line m

B A P0 P1 D C Time Recovery line in-transit messages • Restart: finding a recovery line • In-transit message (logging) m

B A P0 P1 D C Time Recovery lineghost-messages • Restart: finding a recovery line • Ghost-message m

Coordinated checkpointing • Recovery line valid by construction at checkpointing time • Simple 2-phase commit protocol • Relatively easy to implement • Drawback: synchronization • does not scale

Independent checkpointing • Reconstruct a valid recovery line at rollback time • Local states stored independently • Fits to large scale • Drawbacks • Need to store multiple checkpoints (garbage collection) • Need to maintain up-to-date antecedence graphs • Domino effect: long rollback needed in practice

M8 M1 M3 M5 M2 M4 M7 M6 Domino Effect P0 E C P1 F B P2 A D Time

M8 M1 M3 M5 M2 M4 M7 M6 Domino Effect ? P0 E C P1 F B P2 A D Time

M1 M3 M5 M2 M4 M7 M6 Domino Effect P0 E C ? P1 F B P2 A D Time

M1 M3 M5 M2 M4 M6 Domino Effect P0 E C P1 F B P2 ? A D Time

M1 M3 M2 M4 Domino Effect P0 C ? P1 B P2 A D Time

M1 M3 M2 Domino Effect P0 C P1 B P2 ? A Time

M1 M2 Domino Effect ? P0 P1 P2 A Time

Domino Effect P0 P1 P2 Time

Addressing the domino effect (1) • Idea: log communications • Drawback: needs assumptions upon determinism Piecewise deterministic assumption

Addressing the domino effect (2) • Idea: communication-induced checkpointing • Independent checkpointing • Still need to store multiple checkpoints (garbage collection) • Force checkpoints at communication time • Additional information is piggy-backed • If the communication generates a new dependence => forced checkpoint • Updates the current recovery line

A hierarchical protocol for a hierarchical architecture • The protocol needs to reflect the architecture • Relaxed inter-cluster synchronism • Principle • Intra-cluster: coordinated checkpointing • Inter-cluster: communication-induced checkpointing

s1 c1 m1 m2 s2 s3 c2 Time Limit the number of forced checkpoints • It is not necessary to save a checkpoint at each receive • Force a checkpoint only if the sender has saved a checkpoint since its last send • Sequence number • Direct Dependencies Vector (DDV)

Rollback algorithm • A node crashes • Its cluster rolls back • Its cluster sends a [rollback alert] to all clusters in the federation • When receiving a Rollback Alert: check the need to rollback • If rollback is needed send a Rollback Alert with the new sequence number • Else send do not need to rollback message • Loop • Wait for n messages • If one of the n messages is a Rollback Alert • Check the need to rollback (with all the piggybacked sequence number) • If rollback is needed send a Rollback Alert with the new sequence number • Else send do not need to rollback message • Else leave the loop (break)

1 0 0 0 1 0 0 0 1 x y z c1 c2 c3 Time Example Unforced checkpoint with DDV <x,y,z>

1 0 0 0 1 0 0 0 1 x y z c1 m1 1 2 0 x y z c2 c3 Time Example Unforced checkpoint with DDV <x,y,z> Forced checkpoint before the message is taken into account

1 0 0 0 1 0 0 0 1 0 0 2 x y z c1 m1 1 2 0 x y z c2 c3 Time Example Unforced checkpoint with DDV <x,y,z> Forced checkpoint before the message is taken into account

1 0 0 0 1 0 0 0 1 0 0 2 x y z c1 m2 m1 1 2 0 x y z c2 c3 Time Example Unforced checkpoint with DDV <x,y,z> Forced checkpoint before the message is taken into account

0 1 0 1 0 0 x y z 0 0 1 0 0 2 1 3 0 c1 m2 m1 1 2 0 x y z c2 c3 Time Example Unforced checkpoint with DDV <x,y,z> Forced checkpoint before the message is taken into account

0 1 0 1 0 0 x y z 0 0 1 0 0 2 1 3 0 c1 m2 m1 0 3 3 1 2 0 x y z c2 m3 c3 Time Example Unforced checkpoint with DDV <x,y,z> Forced checkpoint before the message is taken into account

2 0 0 1 3 0 0 0 2 1 0 0 0 0 1 0 1 0 x y z c1 m2 m1 3 0 3 0 3 3 x y z 1 2 0 c2 m4 m3 c3 Time Example Unforced checkpoint with DDV <x,y,z> Forced checkpoint before the message is taken into account

2 0 0 1 3 0 0 0 2 0 0 1 0 1 0 x y z 1 0 0 c1 m2 m1 3 0 3 0 3 3 1 2 0 x y z c2 m4 m3 c3 Time Example Unforced checkpoint with DDV <x,y,z> Forced checkpoint before the message is taken into account

0 0 1 x y z 1 0 0 2 0 0 0 1 0 0 0 2 1 3 0 c1 m2 m1 0 3 3 x y z 1 2 0 3 0 3 c2 Alert (3) m4 m3 c3 Time Example Unforced checkpoint with DDV <x,y,z> Forced checkpoint before the message is taken into account

0 1 0 x y z 1 3 0 0 0 2 0 0 1 1 0 0 2 0 0 c1 m2 m1 3 0 3 0 3 3 1 2 0 x y z c2 m4 c3 Alert (3) Time Example Unforced checkpoint with DDV <x,y,z> Forced checkpoint before the message is taken into account

0 1 0 1 3 0 0 0 2 0 0 1 1 0 0 2 0 0 x y z c1 m2 m1 1 2 0 x y z 0 3 3 3 0 3 c2 c3 Time Example Unforced checkpoint with DDV <x,y,z> Forced checkpoint before the message is taken into account

Simulator • Realization of a discrete event simulator • Configurable • Topology • Application • Timers • Based on C++SIM library University of Newcastle upon Tyne (http://cxxsim.ncl.ac.uk) • Threads • Scheduler • Random flows

Experiments: configuration ~3 000 intra-cluster messages X2 10 hours long running application Ethernet 100 between clusters message each ~4 minutes Control message each 50 minutes Cluster 1 100 nodes (Myrinet) Cluster 0 100 nodes (Myrinet)

Experiments:number of forced checkpoints • Impact of Cluster 0 unforced checkpoints on Cluster 1 • No unforced checkpoints in Cluster 1

Experiments:number of forced checkpoints • Increasing number of unforced checkpoints in Cluster 1

Experiments:communication patterns • Increasing the number of messages from Custer 1 to Cluster 0 • Unforced checkpoints initiated each 30 minutes in each cluster

Conclusion • Quasi-synchronous, hierarchical, hybrid protocol • Works well if • Few inter-cluster communications • Quasi-unidirectional inter-cluster communications • Improvements • Support for more communication patterns • Simultaneous faults in different clusters • Dynamic architecture modification • Implementation

A p1 m (B) p2 B Time Optimization • The sender doesn’t need to rollback if messages are logged • Optimistic logging on the sender • Which messages to replay ? • Inter-cluster messages are acknowledged with the sequence number

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations