1 / 9

Safetynet : Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery

Safetynet : Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery. 03/05/2010. Presented by Akin Olugbade. D. Sorin M. Martin M. Hill D. Wood. Motivation.

clodia
Download Presentation

Safetynet : Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery 03/05/2010 Presented by AkinOlugbade D. Sorin M. Martin M. Hill D. Wood

  2. Motivation • Increase in processor speed and decrease in processor technology size make chips more susceptible to errors • Systems need high availability • Shared memory multiprocessor servers make up a lot of internet servers • Rebooting or system crashes are an undesirable way to deal with errors

  3. SafteyNet Design • Create globally consistent checkpoints that the system can recover to in the case an error is detected • Save architected state which consists of processor registers, memory state, coherence state • Validate that a checkpoint is fault free • Recover to most recent validated checkpoint in case of error

  4. SafetyNet Design • Logging space reduced • Only log changes to a certain register, memory block, or coherence permission once per checkpoint interval • Point of Atomicity • Requestor does not increment recovery point until all outstanding requests are completed • Consistent logical time ensures global consistency of checkpoints • Validation • All components must agree that a checkpoint is a valid fault free point for it to be validated

  5. Logical Time

  6. Evaluation

  7. Evaluation

  8. Conclusion • + Checkpoint/Recovery system can be independent of error detection mechanism • +Negligible performance overhead in error free common case • +Storage and Bandwidth overhead can be minimized greatly by increasing checkpoint interval

  9. Questions • Does the Validation Latency matter in the case of output commit? • How do we deal with stores in the case of CLB fillup? • Is SafteyNet suitable for mission critical situations? • If our validation time is fast enough, would we want to reduce the checkpoint interval time?

More Related