Safetynet : Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery

Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery 03/05/2010 Presented by AkinOlugbade D. Sorin M. Martin M. Hill D. Wood

Motivation • Increase in processor speed and decrease in processor technology size make chips more susceptible to errors • Systems need high availability • Shared memory multiprocessor servers make up a lot of internet servers • Rebooting or system crashes are an undesirable way to deal with errors

SafteyNet Design • Create globally consistent checkpoints that the system can recover to in the case an error is detected • Save architected state which consists of processor registers, memory state, coherence state • Validate that a checkpoint is fault free • Recover to most recent validated checkpoint in case of error

SafetyNet Design • Logging space reduced • Only log changes to a certain register, memory block, or coherence permission once per checkpoint interval • Point of Atomicity • Requestor does not increment recovery point until all outstanding requests are completed • Consistent logical time ensures global consistency of checkpoints • Validation • All components must agree that a checkpoint is a valid fault free point for it to be validated

Logical Time

Evaluation

Conclusion • + Checkpoint/Recovery system can be independent of error detection mechanism • +Negligible performance overhead in error free common case • +Storage and Bandwidth overhead can be minimized greatly by increasing checkpoint interval

Questions • Does the Validation Latency matter in the case of output commit? • How do we deal with stores in the case of CLB fillup? • Is SafteyNet suitable for mission critical situations? • If our validation time is fast enough, would we want to reduce the checkpoint interval time?

Safetynet : Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery