SafetyNet

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood March 31st 2006

Target: • Systems where availability is crucial • SMP Commercial Servers: Application Services, Database Management Systems Motivation: • Increase in Performance => Decrease in feature size => Decrease in Reliability • Cost of fault-tolerant solution: Important

Approach and Challenges • Decouple: • Local Fault Detection - ECC, timeout, etc. • Lightweight & Global Fault Recovery - SafetyNet • Challenges for lightweight recovery schemes: • Amount of storage (checkpoints logs) • Maintain consistent global recovery point • Advance global recovery point

SafetyNet: High-Level View • Maintain per processor checkpoints: • Oneglobally validated recovery point • Multiple coordinated checkpoints pending validation • ID by global logical timestamp • Fault detected => recover state to Recovery Point (Global)

Solutions: Storage • Checkpoint architectural state: • Registers: • Shadow registers or cached copies • Copy once on beginning of checkpoint • Memory and Caches: • Checkpoint Log Buffers (CLBs) • Log incrementallystores, ownership change • Log only first update per block per checkpoint

Solution: Global Coherence • Logical Time Base: • General agreement on checkpoint interval for each coherence transaction • Loosely synchronous checkpoint clock • Maintain per block Checkpoint number (CN)

Solution: Global Recovery Point • Checkpoint Validation: • All agree execution to that point Error Free • Broadcast new Recovery Point Checkpoint Number • Restart: • Drain interconnection network • Discard in progress coherence state • Processors: restore register checkpoint • Memory: undo actions in Checkpoint Log Buffers (CLBs) • Caches: undo CLB

Evaluation: Performance Impact

Evaluation: Sensitivity

Evaluation: Sensitivity (Cont)

Questions • Why is having a coordinated checkpoint important? • Why broadcast Recovery Point Checkpoint Number twice: • when advancing the recovery point • when triggering recovery? • Why a Sequential Consistent model? • Is the scheme valid for Processor Consistency? • Is this a good idea? Has it caught on?

SafetyNet

SafetyNet

Presentation Transcript

Healthcare SafetyNet August 12, 2011

Safetynet : Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery

HIT and Interoperability Initiatives: Assuring a Role for the LHD SafetyNet

OA in snow-crab processing workers of Newfounland-Labrador SafetyNet – 2002-3

Ann Michaels & Associates, Ltd. SafetyNet Social Media Monitoring Results

SafetyNet

SafetyNet

Presentation Transcript

Healthcare SafetyNet August 12, 2011

Safetynet : Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery

HIT and Interoperability Initiatives: Assuring a Role for the LHD SafetyNet

OA in snow-crab processing workers of Newfounland-Labrador SafetyNet – 2002-3

Ann Michaels &amp; Associates, Ltd. SafetyNet Social Media Monitoring Results

Ann Michaels & Associates, Ltd. SafetyNet Social Media Monitoring Results