A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign

Motivation • As machines grow in size • MTBF decreases • Jaguar had 2.33 average failures/day from 2008 to 2010 • Applications have to tolerate faults • Challenges for exascale: • Disk-based (NFS reliable disk) checkpointing is slow • System-level checkpointing can be expensive • Scalable checkpointing/restart can be a communication intensive process • Job scheduler prevent fault tolerance support in runtime Charm++ Workshop 2012

Motivation (cont.) • Applications on future exascale machines need fast, low cost and scalable fault tolerance support • Previous work: • double in-memory checkpoint/restart scheme • In production version of Charm++ since 2004 Charm++ Workshop 2012

Double in-memory Checkpoint/Restart Protocol PE3 PE0 PE1 PE2 I H J A G B D E F C H I J F G D E B C A A I H B C J G D F E PE1 crashed ( lost 1 processor ) PE0 PE2 PE3 I B C H J A G D E F D H J G A B C F E I A C E J H I F G B D checkpoint 1 checkpoint 2 object restored object A A A A Charm++ Workshop 2012

Runtime Support for FT • Automatically checkpointing threads • Including stack and heap (isomalloc) • User helper functions • To pack and unpack data • Checkpointing only the live variables Charm++ Workshop 2012

Local Disk-Based Protocol • Double in-memory checkpointing • Memory concern • Pick checkpointing time where global state is small • MD, N-body, quantum chemistry • Double In-disk checkpointing • Make use of local disk (or SSD) • Also does not rely on any reliable storage • Useful for applications with very big memory footprint Charm++ Workshop 2012

Previous Results: Performance Comparisons with Traditional Disk-based Checkpointing Charm++ Workshop 2012

Previous Results: Restart with Load Balancing LeanMD, Apoa1, 128 processors Charm++ Workshop 2012

Previous Result: Recovery Performance • 10 crashes • 128 processors • Checkpoint every 10 time steps Charm++ Workshop 2012

FT on MPI-based Charm++ • Practical challenge: job scheduler • Job scheduler kills the entire job when a process fails • MPI-based Charm++ is portable on major supercomputers • A fault injection scheme in MPI machine layer • DieNow() • MPI process stop responding • Fault detection by keep-alive messages • Spare processors to replace failed ones • Demonstrated on 64K cores of BG/P machine Charm++ Workshop 2012

Performance at Large Scale Charm++ Workshop 2012

Optimization for scalability • Communication bottlenecks • Checkpoint/restart time takes O(P) time • Optimizations: • Collectives (barriers) • Switch O(P) barrier to a tree-based barrier • Stale message handling • Epoch number • A phase to discard stale messages as quickly as possible • Small messages • Streaming optimization Charm++ Workshop 2012

LeanMD Checkpoint Time before/after Optimization Charm++ Workshop 2012

Checkpoint Time for Jacobi/AMPI Kraken Charm++ Workshop 2012

LeanMD Restart Time Charm++ Workshop 2012

Conclusions and Future work • In-memory checkpointing after optimization is scalable towards Exascale • A short paper is accepted at the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012) • Future work: • Non-blocking checkpointing Charm++ Workshop 2012

A Scalable Double In-memory Checkpoint and Restart Scheme Towards Exascale