190 likes | 369 Views
ReVive: Cost Effective Architectural Support for Rollback Recovery in Shared Memory Multiprocessors. Milos Prvulovic , Zheng Zhang , Josep Torrellas University of Illinois at Urbana-Champaign. Outline. Introduction ReVive Hardware ReVive Operation Evaluation Conclusion.
E N D
ReVive: Cost EffectiveArchitectural Support for Rollback Recovery in Shared MemoryMultiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University of Illinois at Urbana-Champaign
Outline • Introduction • ReVive Hardware • ReVive Operation • Evaluation • Conclusion
Introduction to ReVive - Goals • Rollback recovery mechanism forshared-memory multiprocessors • General Purpose • Low Hardware Cost • Works with unmodified CPUs, Caches • Low Overhead • High Availability (>5 9s)
Background - Taxonomy of Backward Error Recover in Multiprocessors
White: not modified N errors permitted Grey: Changed by ReVive only 1 may fail Parity bits protect log, memory, all messages ReVive Hardware - Overview
Revive HardwareDistributed Parity Protection in Memory • Parity bits to protect the memory • And the log • Distributed parity organization • Relieve pressure from single parity node • Update parity when line changes
Revive Logging • Logs memory lines when they are first updated after a checkpoint • CPU may warn with ReadX or Upgrade • Maintain Logged bit per line in the directory controller • Log entries guarded by end markers
ReVive Logging • All operations not on critical path • However, ACKs may delay • Multiple accesses to same or consecutive lines • Current DRAMs may exploit it
ReViveGlobal Checkpoint • Timer Interrupt • Cache Flush • Barrier • Write Marker • Barrier (2 Phase Commit)
ReVive: Rollback • Phase 0: Detect the error (80ms) • Phase 1: Identify faulty node, reroute (50ms) • Phase 2: Reconstruct Log of faulty node (100ms) • Phase 3: “Rollback” (490ms) – Mark compromised parity blocks as inaccessible • Phase 4: Reconstruct Data in parallel with normal execution (20sec) • Access to inaccessible data causes page protection fault and priority reconstruction • Performance degraded due to lost node, background reconstruction and page faults
ReVive EvaluationModel System, Benchmarks Not the best selection for benchmarks or basesystem
ReVive EvaluationError – Free Overheads • Error-Free Operation overheads caused by: • Logging: higher as number of distinct writes increases • Parity Maintenance: higher as footprint increases • Logging and parity do not affect processor directly – not on the critical path • Global Checkpointing
ReVive EvaluationError – Free Overheads • 7 + 1 parity group or 1 + 1 mirroring • Mirroring is simpler than parity but has increased storage requirements. • Checkpoint every 10ms (equivalent to 100ms) or never
ReVive EvaluationMemory and Network Traffic Breakdown • RD/RDX normal data miss • Exe WB normal writeback • Ckp WB checkpoint writeback • LOG log traffic • PAR parity traffic (for data and log) • High Traffic causes high overheads
ReVive EvaluationMemory Storage Requirements • 2.5 MB for their programs for the log => 25 MB for the log for 100 ms checkpoint frequency • For 2GB / node memory: • 7 + 1 parity: 14%-25% memory overhead (0.1-1sec checkpointing) • 1 + 1 mirror: up to 62% overhead
ReVive EvaluationRollback Overheads - Breakdown • Figure shows Phase 2 and 3 overheads (must be multiplied by 10) => less than 590 ms • Detection latency, lost work, hardware rerouting 80ms, 100ms, 50ms • 820 ms worst case. If once per day availability 99.999%
Conclusion • Simple mechanism • Fairly low hardware modifications • However, protection offered limited • If 2 nodes lose DRAM…