340 likes | 459 Views
Fault Tolerance in Charm++. Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana-Champaign. Motivation. As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support
E N D
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana-Champaign
Motivation • As machines grow in size • MTBF decreases • Applications have to tolerate faults • Applications need fast, low cost and scalable fault tolerance support • Fault tolerant runtime for: • Charm++ • Adaptive MPI
Outline • Disk Checkpoint/Restart • FTC-Charm++ • in-memory checkpoint/restart • Proactive Fault Tolerance • FTL-Charm++ • message logging
Checkpoint/Restart • Simplest scheme for application fault tolerance • Any long running application saves its state into disk periodically at certain point • coordinated checkpointing strategy (barrier) • State information is saved in a directory of your choosing • Checkpoint of the application data is done by invoking pup routine of all objects • Restore also uses pup, so no additional application code is needed (pup is all you need)
Checkpointing Job • In Charm++, use: • void CkStartCheckpoint(char* dirname,const CkCallback& cb) • Called on one processor; calls resume when checkpoint is complete • In AMPI, use: • MPI_Checkpoint(<dir>); • Collective call; returns when checkpoint is complete
Restart Job from Checkpoint • The charmrun option ++restart <dir> is used to restart • ./charmrun +p4 ./pgm ++restart log • Number of processors need not be the same • Parallel objects are redistributed when needed
FTC-Charm++ In-Memory Checkpoint/Restart
Disk vs. In-memory Scheme • Disk checkpointing suffers • Need user intervention to restart a job • Assume reliable storage - disk • Disk I/O is slow • In-memory checkpoint/restart scheme • Online version of the previous scheme • Low impact on fault-free execution • Provide fast and automatic restart capability • Does not rely on extra processors • Maintain execution efficiency after restart • Does not rely on any fault-free component • Not assume stable storage
Overview • Coordinated checkpointing scheme • Simple, low overhead on fault-free execution • Scientific applications that are iterative • Double checkpointing • Tolerate one failure at a time • In-memory checkpointing • Diskless checkpointing • Efficient for applications with small memory footprint • In case when there is no extra processors • Program continue to run with remaining processors • Load balancing for restart
Checkpoint Protocol • Similar to the previous scheme • coordinated checkpointing strategy • Programmers decide what to checkpoint • void CkStartMemCheckpoint(CkCallback &cb) • Each object pack data and send to two different (buddy) processors
Restart protocol • Initiated by the failure of a physical processor • Every object rolls back to the state preserved in the recent checkpoints • Combine with load balancer to sustain the performance
Checkpoint/Restart Protocol PE3 PE0 PE1 PE2 I H J A G B D E F C H I J F G D E B C A A I H B C J G D F E PE1 crashed ( lost 1 processor ) PE0 PE2 PE3 I B C H J A G D E F D H J G A B C F E I A C E J H I F G B D checkpoint 1 checkpoint 2 object restored object A A A A
Local Disk-Based Protocol • Double in-memory checkpointing • Memory concern • Pick checkpointing time where global state is small • Double In-disk checkpointing • Make use of local disk • Also does not rely on any reliable storage • Useful for applications with very big memory footprint
Compiling FTC-Charm++ • Build charm with “syncft” option • ./build charm++ net-linux syncft –O • Command line switch +ftc_disk for disk/memory checkpointing: • charmrun ./pgm +ftc_disk
Performance Evaluation • IA-32 Linux cluster at NCSA • 512 dual 1Ghz Intel Pentium III processors • 1.5GB RAM each processor • Connected by both Myrinet and 100MBit Ethernet
Performance Comparisons with Traditional Disk-based Checkpointing
Recovery Performance • Molecular Dynamics Simulation application - LeanMD • Apoa1 benchmark (92K atoms) • 128 processors • Crash simulated by killing processes • No backup processors • With load balancing
Performance improve with Load Balancing LeanMD, Apoa1, 128 processors
Recovery Performance • 10 crashes • 128 processors • Checkpoint every 10 time steps
LeanMD with Apoa1 benchmark • 90K atoms • 8498 objects
Motivation • Run-time reacts to a failure • Proactively migrate from a processor about to fail • Modern hardware supports early fault indication • SMART protocol, Motherboard temperature sensors, Myrinet interface cards • Possible to create mechanism for fault prediction
Requirements • Response time should be as low as possible • No new processes should be required • Collective operations should still work • Efficiency loss should be proportional to computing power loss
System • Application is warned of impending fault via signal • Processor, memory and interconnect should continue to work correctly for sometime after warning • Run-time ensures that application continues to run on the remaining processors even if one processor crashes
Solution Design • Migrate Charm++ objects off warned processor • Point to point message delivery should continue to work • Collective operations should cope with the possible loss of multiple processors • Modify the runtime system's reduction tree to remove the warned processor. • Minimal number of processors should be affected • Runtime system should remain load balanced after a processor has been evacuated
Original utilization Utilization after fault Proactive FT: Current Status • Status • Support for multiple faults ready; currently testing support for simultaneous faults • Faults simulated via signal sent to process • Current version fully integrated to Charm++ and AMPI • Example: sweep3d (MPI code) on NCSA’s tungsten Utilization after LB 27
How to Use • Part of default version of Charm++ • No extra compiler flags required • This code does not get executed until a warning • Any detection system can be plugged in • Can send signal (USR1) to process on compute node • Can call a method (CkDecideEvacPe) to evacuate a processor • Used with any Charm++ and AMPI program • For AMPI needs to be used with -memory isomalloc
FTL-Charm++ Message Logging
Motivation • Checkpointing not fully automatic • Coordinated checkpointing is expensive • Checkpoint/Rollback doesn’t scale • All nodes are rolled back just because 1 crashed • Even nodes independent of the crashed node are restarted 30
Design • Message Logging • Sender side message logging • Asynchronous checkpoints • Each processor has a buddy processor • Stores its checkpoint in the buddy’s memory • Checkpoint on its own (no barrier) 31
Message to Remote Chares Chare P sender <SN, TN, Message> <SN,TN, Receiver> <Sender, SN> Chare Q receiver • If <sender, SN> has been seen earlier TN is marked as received • Otherwise create new TN and store the <sender, SN,TN> 32
Status • Most of Charm++ and AMPI has been ported • Support for migration has not yet been implemented in the fault tolerant protocol • Parallel restart not yet implemented • Not in Charm main branch 33
Thank You! Free source, binaries, manuals, and more information at:http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois