Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Motivation and Goals • Availability is crucial • Internet services and database management systems highly relied upon • Unless architecture is changed, availability will decrease as # of components increases • Goals for paper • Lightweight mechanism providing E2E recovery for transient and permanent faults • Decouple recovery from detection (use traditional detection techniques: RAID, ECC, duplicate ALUs, etc)

Solution • SafetyNet • A global checkpoint/recovery scheme • Creates periodic system-wide logical checkpoints • Log all changes to the architected state

Recovery Scheme Challenges • (1) Saving previous values before every register/ cache update or coherence message would require too much storage space • (2) All processors and components must recover to a consistent point • (3) SafetyNet must determine when it is safe to roll-back to a recovery checkpoint

(1) Checkpointing Via Logging • Checkpoints contain a complete copy of system’s architectural state • Taken at coarse granularity (e.g. 100,000 cycles) • Log is only taken on first altering action per checkpoint interval.

(2) Consistent Checkpoints • All components coordinate local checkpoints through logical time • Coherence transaction appears logically atomic once it completes • Point of atomicity • When previous owner processes request • Response includes CN of this PoA • Requestor does not advance recovery point until all outstanding transactions are complete

(2) PoA Example

(3) Validating Checkpoints • States: current state, checkpoints waiting to be validated, recovery point • Validation: determining which checkpoint is the recovery point • All prior execution must be fault-free • Coordination is pipelined and performed in background (off critical path)

(3) Validation continued • Validation latency depends on fault detection latency • Output commit problem • Delay all output events until checkpoint is validated • Depends on validation latency • Input Commit problem • Log incoming messages

Recovery • Processors restore their register checkpoints • Caches and memories unroll local logs • State from coherence transactions in progress is discarded • Reconfiguration if necessary

Implementation

Implementation (Cont’d) • Checkpoint Log Buffers (CLBs) • Associated with each cache and memory component • Store log state • Shadow registers • 2D torus • MOSI Directory protocol

Logical Time Base • Loosely synchronous checkpoint clock distributed redundantly • Ensures no single point of failure • Edge of clock increments current checkpoint number (CCN) • Works as long as skew < minimum communication time between nodes • Assigning transaction  checkpoint interval is protocol-dependent

Logging • Memory block written to CLB whenever action might have to be undone • CLBs are write-only (except during recovery) and off critical path • CN added to each block in cache • Steps taken for update-action • CCN compared with block’s CN • Block is logged if CCN >= CN • Updates block’s CN to CCN+1 • Performs the update action • Updated CN sent with coherence response

Checkpoint Creation/Validation • Choose suitable checkpoint clock freq • Detection latency tolerance • Total CLB storage • Lost messages (timeout) trigger recovery • Recovery point checkpoint number (RPCN) broadcasted when recovery point is advanced • After fault, recovery msg sent (includes RPCN) • Interconnection network is drained • Processors, cache, memories recover to RPCN

Implementation Summary • Processor/Cache Changes Required • Processor must be able to checkpoint its register state • Must be able to copy old versions out of cache before transferring them • CNs added to L1 cache block • Directory Protocol Changes • CNs added to data response messages • Coherence requests can be nack’ed • Final ack required from requestor to directory

Evaluation Parameters • 16-processor target system • Simics + memory hierarchy simulator • 4 commercial workloads, 1 scientific • In order processor, 4 billion instr/sec • MOSI directory protocol with 2D torus • Checkpoint interval = 100,000 cycles

Experiments • (1) Fault-free performance • Overhead determined to be negligible • (2) Dropped messages • Periodic transient faults injected (10/sec) • Recovery latency << crash + reboot • (3) Lost Switch • Hard fault – kill half-switch • Crash avoided – performance suffers due to restricted bandwidth

Sensitivity Analysis • Cache bandwidth • Depends on frequency of stores requiring logging (additional b/w consumed reading old copy of the block) • Cache ownership transfer: no additional b/w • Storage cost • CLBs sized to avoid performance degradation due to full buffers • Entries per checkpoint corresponds to logging frequency

Conclusion • SafetyNet is efficient in common case (error free execution) – little to no latency added • Latency is hidden by pipelining validation of checkpoints • Checkpoints coordinated in logical time (no synch exchanges necessary)

Questions • What about faults/errors with the saved state itself? • What if you there’s a permanent fault for which you can’t reconfigure (endless loop of recovering to last checkpoint?)

Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004

Presentation Transcript

The Startle Reflex: A Measure of Emotion and “Attention” John J. Curtin, Ph.D. University of Wisconsin, Madison

Adversarial Search Aka Games

Madison Digital Image Database User Group

Spotlight Case March 2004

Game Playing

Sleep Disorders in Older Persons

Short M ATLAB Tutorial

The Farm Bill: What is the Current Status?

Business Types that Succeed and Make Downtown Successful

FCP: A Flexible Transport Framework for Accommodating Diversity

S eismic T omography and Double- D ifference S eismic T omography

Deconstructing Storage Arrays

Hall Center Oral History Workshop March 12, 2004

University of Wisconsin ELDA Academy Madison, WI July, 2008

Trends and Developments in Online Education

Electronic Medical Record

Methods of Sociological Inquiry

CBRF Medication Administration Training

Short M ATLAB Tutorial

Forensic Overview

Craig A. Albers, Ph.D. Assistant Professor University of Wisconsin – Madison caalbers@wisc

Weather 101 and beyond