110 likes | 291 Views
SafetyNet. Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood March 31 st 2006. Target: Systems where availability is crucial
E N D
SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood March 31st 2006
Target: • Systems where availability is crucial • SMP Commercial Servers: Application Services, Database Management Systems Motivation: • Increase in Performance => Decrease in feature size => Decrease in Reliability • Cost of fault-tolerant solution: Important
Approach and Challenges • Decouple: • Local Fault Detection - ECC, timeout, etc. • Lightweight & Global Fault Recovery - SafetyNet • Challenges for lightweight recovery schemes: • Amount of storage (checkpoints logs) • Maintain consistent global recovery point • Advance global recovery point
SafetyNet: High-Level View • Maintain per processor checkpoints: • Oneglobally validated recovery point • Multiple coordinated checkpoints pending validation • ID by global logical timestamp • Fault detected => recover state to Recovery Point (Global)
Solutions: Storage • Checkpoint architectural state: • Registers: • Shadow registers or cached copies • Copy once on beginning of checkpoint • Memory and Caches: • Checkpoint Log Buffers (CLBs) • Log incrementallystores, ownership change • Log only first update per block per checkpoint
Solution: Global Coherence • Logical Time Base: • General agreement on checkpoint interval for each coherence transaction • Loosely synchronous checkpoint clock • Maintain per block Checkpoint number (CN)
Solution: Global Recovery Point • Checkpoint Validation: • All agree execution to that point Error Free • Broadcast new Recovery Point Checkpoint Number • Restart: • Drain interconnection network • Discard in progress coherence state • Processors: restore register checkpoint • Memory: undo actions in Checkpoint Log Buffers (CLBs) • Caches: undo CLB
Questions • Why is having a coordinated checkpoint important? • Why broadcast Recovery Point Checkpoint Number twice: • when advancing the recovery point • when triggering recovery? • Why a Sequential Consistent model? • Is the scheme valid for Processor Consistency? • Is this a good idea? Has it caught on?