Fault Tolerance in Charm++

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana-Champaign

Motivation • As machines grow in size • MTBF decreases • Applications have to tolerate faults • Applications need fast, low cost and scalable fault tolerance support • Fault tolerant runtime for: • Charm++ • Adaptive MPI

Outline • Disk Checkpoint/Restart • FTC-Charm++ • in-memory checkpoint/restart • Proactive Fault Tolerance • FTL-Charm++ • message logging

Disk Checkpoint/Restart

Checkpoint/Restart • Simplest scheme for application fault tolerance • Any long running application saves its state into disk periodically at certain point • coordinated checkpointing strategy (barrier) • State information is saved in a directory of your choosing • Checkpoint of the application data is done by invoking pup routine of all objects • Restore also uses pup, so no additional application code is needed (pup is all you need)

Checkpointing Job • In Charm++, use: • void CkStartCheckpoint(char* dirname,const CkCallback& cb) • Called on one processor; calls resume when checkpoint is complete • In AMPI, use: • MPI_Checkpoint(<dir>); • Collective call; returns when checkpoint is complete

Restart Job from Checkpoint • The charmrun option ++restart <dir> is used to restart • ./charmrun +p4 ./pgm ++restart log • Number of processors need not be the same • Parallel objects are redistributed when needed

FTC-Charm++ In-Memory Checkpoint/Restart

Disk vs. In-memory Scheme • Disk checkpointing suffers • Need user intervention to restart a job • Assume reliable storage - disk • Disk I/O is slow • In-memory checkpoint/restart scheme • Online version of the previous scheme • Low impact on fault-free execution • Provide fast and automatic restart capability • Does not rely on extra processors • Maintain execution efficiency after restart • Does not rely on any fault-free component • Not assume stable storage

Overview • Coordinated checkpointing scheme • Simple, low overhead on fault-free execution • Scientific applications that are iterative • Double checkpointing • Tolerate one failure at a time • In-memory checkpointing • Diskless checkpointing • Efficient for applications with small memory footprint • In case when there is no extra processors • Program continue to run with remaining processors • Load balancing for restart

Checkpoint Protocol • Similar to the previous scheme • coordinated checkpointing strategy • Programmers decide what to checkpoint • void CkStartMemCheckpoint(CkCallback &cb) • Each object pack data and send to two different (buddy) processors

Restart protocol • Initiated by the failure of a physical processor • Every object rolls back to the state preserved in the recent checkpoints • Combine with load balancer to sustain the performance

Checkpoint/Restart Protocol PE3 PE0 PE1 PE2 I H J A G B D E F C H I J F G D E B C A A I H B C J G D F E PE1 crashed ( lost 1 processor ) PE0 PE2 PE3 I B C H J A G D E F D H J G A B C F E I A C E J H I F G B D checkpoint 1 checkpoint 2 object restored object A A A A

Local Disk-Based Protocol • Double in-memory checkpointing • Memory concern • Pick checkpointing time where global state is small • Double In-disk checkpointing • Make use of local disk • Also does not rely on any reliable storage • Useful for applications with very big memory footprint

Compiling FTC-Charm++ • Build charm with “syncft” option • ./build charm++ net-linux syncft –O • Command line switch +ftc_disk for disk/memory checkpointing: • charmrun ./pgm +ftc_disk

Performance Evaluation • IA-32 Linux cluster at NCSA • 512 dual 1Ghz Intel Pentium III processors • 1.5GB RAM each processor • Connected by both Myrinet and 100MBit Ethernet

Performance Comparisons with Traditional Disk-based Checkpointing

Recovery Performance • Molecular Dynamics Simulation application - LeanMD • Apoa1 benchmark (92K atoms) • 128 processors • Crash simulated by killing processes • No backup processors • With load balancing

Performance improve with Load Balancing LeanMD, Apoa1, 128 processors

Recovery Performance • 10 crashes • 128 processors • Checkpoint every 10 time steps

LeanMD with Apoa1 benchmark • 90K atoms • 8498 objects

Proactive Fault Tolerance

Motivation • Run-time reacts to a failure • Proactively migrate from a processor about to fail • Modern hardware supports early fault indication • SMART protocol, Motherboard temperature sensors, Myrinet interface cards • Possible to create mechanism for fault prediction

Requirements • Response time should be as low as possible • No new processes should be required • Collective operations should still work • Efficiency loss should be proportional to computing power loss

System • Application is warned of impending fault via signal • Processor, memory and interconnect should continue to work correctly for sometime after warning • Run-time ensures that application continues to run on the remaining processors even if one processor crashes

Solution Design • Migrate Charm++ objects off warned processor • Point to point message delivery should continue to work • Collective operations should cope with the possible loss of multiple processors • Modify the runtime system's reduction tree to remove the warned processor. • Minimal number of processors should be affected • Runtime system should remain load balanced after a processor has been evacuated

Original utilization Utilization after fault Proactive FT: Current Status • Status • Support for multiple faults ready; currently testing support for simultaneous faults • Faults simulated via signal sent to process • Current version fully integrated to Charm++ and AMPI • Example: sweep3d (MPI code) on NCSA’s tungsten Utilization after LB 27

How to Use • Part of default version of Charm++ • No extra compiler flags required • This code does not get executed until a warning • Any detection system can be plugged in • Can send signal (USR1) to process on compute node • Can call a method (CkDecideEvacPe) to evacuate a processor • Used with any Charm++ and AMPI program • For AMPI needs to be used with -memory isomalloc

FTL-Charm++ Message Logging

Motivation • Checkpointing not fully automatic • Coordinated checkpointing is expensive • Checkpoint/Rollback doesn’t scale • All nodes are rolled back just because 1 crashed • Even nodes independent of the crashed node are restarted 30

Design • Message Logging • Sender side message logging • Asynchronous checkpoints • Each processor has a buddy processor • Stores its checkpoint in the buddy’s memory • Checkpoint on its own (no barrier) 31

Message to Remote Chares Chare P sender <SN, TN, Message> <SN,TN, Receiver> <Sender, SN> Chare Q receiver • If <sender, SN> has been seen earlier TN is marked as received • Otherwise create new TN and store the <sender, SN,TN> 32

Status • Most of Charm++ and AMPI has been ported • Support for migration has not yet been implemented in the fault tolerant protocol • Parallel restart not yet implemented • Not in Charm main branch 33

Thank You! Free source, binaries, manuals, and more information at:http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois

Fault Tolerance in Charm++

Fault Tolerance in Charm++

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance