210 likes | 305 Views
SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004. Motivation and Goals. Availability is crucial
E N D
SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004
Motivation and Goals • Availability is crucial • Internet services and database management systems highly relied upon • Unless architecture is changed, availability will decrease as # of components increases • Goals for paper • Lightweight mechanism providing E2E recovery for transient and permanent faults • Decouple recovery from detection (use traditional detection techniques: RAID, ECC, duplicate ALUs, etc)
Solution • SafetyNet • A global checkpoint/recovery scheme • Creates periodic system-wide logical checkpoints • Log all changes to the architected state
Recovery Scheme Challenges • (1) Saving previous values before every register/ cache update or coherence message would require too much storage space • (2) All processors and components must recover to a consistent point • (3) SafetyNet must determine when it is safe to roll-back to a recovery checkpoint
(1) Checkpointing Via Logging • Checkpoints contain a complete copy of system’s architectural state • Taken at coarse granularity (e.g. 100,000 cycles) • Log is only taken on first altering action per checkpoint interval.
(2) Consistent Checkpoints • All components coordinate local checkpoints through logical time • Coherence transaction appears logically atomic once it completes • Point of atomicity • When previous owner processes request • Response includes CN of this PoA • Requestor does not advance recovery point until all outstanding transactions are complete
(3) Validating Checkpoints • States: current state, checkpoints waiting to be validated, recovery point • Validation: determining which checkpoint is the recovery point • All prior execution must be fault-free • Coordination is pipelined and performed in background (off critical path)
(3) Validation continued • Validation latency depends on fault detection latency • Output commit problem • Delay all output events until checkpoint is validated • Depends on validation latency • Input Commit problem • Log incoming messages
Recovery • Processors restore their register checkpoints • Caches and memories unroll local logs • State from coherence transactions in progress is discarded • Reconfiguration if necessary
Implementation (Cont’d) • Checkpoint Log Buffers (CLBs) • Associated with each cache and memory component • Store log state • Shadow registers • 2D torus • MOSI Directory protocol
Logical Time Base • Loosely synchronous checkpoint clock distributed redundantly • Ensures no single point of failure • Edge of clock increments current checkpoint number (CCN) • Works as long as skew < minimum communication time between nodes • Assigning transaction checkpoint interval is protocol-dependent
Logging • Memory block written to CLB whenever action might have to be undone • CLBs are write-only (except during recovery) and off critical path • CN added to each block in cache • Steps taken for update-action • CCN compared with block’s CN • Block is logged if CCN >= CN • Updates block’s CN to CCN+1 • Performs the update action • Updated CN sent with coherence response
Checkpoint Creation/Validation • Choose suitable checkpoint clock freq • Detection latency tolerance • Total CLB storage • Lost messages (timeout) trigger recovery • Recovery point checkpoint number (RPCN) broadcasted when recovery point is advanced • After fault, recovery msg sent (includes RPCN) • Interconnection network is drained • Processors, cache, memories recover to RPCN
Implementation Summary • Processor/Cache Changes Required • Processor must be able to checkpoint its register state • Must be able to copy old versions out of cache before transferring them • CNs added to L1 cache block • Directory Protocol Changes • CNs added to data response messages • Coherence requests can be nack’ed • Final ack required from requestor to directory
Evaluation Parameters • 16-processor target system • Simics + memory hierarchy simulator • 4 commercial workloads, 1 scientific • In order processor, 4 billion instr/sec • MOSI directory protocol with 2D torus • Checkpoint interval = 100,000 cycles
Experiments • (1) Fault-free performance • Overhead determined to be negligible • (2) Dropped messages • Periodic transient faults injected (10/sec) • Recovery latency << crash + reboot • (3) Lost Switch • Hard fault – kill half-switch • Crash avoided – performance suffers due to restricted bandwidth
Sensitivity Analysis • Cache bandwidth • Depends on frequency of stores requiring logging (additional b/w consumed reading old copy of the block) • Cache ownership transfer: no additional b/w • Storage cost • CLBs sized to avoid performance degradation due to full buffers • Entries per checkpoint corresponds to logging frequency
Conclusion • SafetyNet is efficient in common case (error free execution) – little to no latency added • Latency is hidden by pipelining validation of checkpoints • Checkpoints coordinated in logical time (no synch exchanges necessary)
Questions • What about faults/errors with the saved state itself? • What if you there’s a permanent fault for which you can’t reconfigure (endless loop of recovering to last checkpoint?)