Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: ShuguangFeng* Shantanu Gupta AminAnsari Scott Mahlke David August University of Michigan *Currently with Northrop Grumman, Information Systems Sector 1

“Failure to prepare is preparing to fail…” • - Benjamin Franklin • …many ways to fail • The distinction between a transient and permanent fault is becoming blurred Electromigration Oxide Breakdown PVT Variation • Transient (“soft”) Faults • Permanent (“hard”) Faults [Dreslinski`10] NTC Computing [Gupta`09] • Rare • Continuous • Periodic • Many permanent faults, particularly wearout-induced faults, initially manifest as timing errors. Cosmic Radiation Packaging Impurities Negative Bias Temperature Instability 2

The Future of Soft Errors One failure per DAY per chip Past Present Future One failure per DAY per 100 chips Aggressive voltage scaling (near-threshold computing) One failure per MONTH per 100 chips 3

Realizing a Reliability “Pipeline” • Commodity systems present both challenges and opportunities • Challenge: HW speculation support (if it exists) is limited • Challenge: Cannot afford expensive, heavyweight SW • checkpointing • Opportunity: Typically not running mission-critical applications • Sacrifice a small degree of reliability • Exploit (probabilistic) idempotence in program execution • Vulnerable • Computation • Vulnerable • Computation • Detection • Detection • Diagnosis • Repair • Recovery • Recovery • Generally involves some form of rollback/re-execution • Identify fault site • Restore processor to pre-fault state, before 1) • Resume execution from 1) • Many low-cost detection techniques rely on hardware speculation support • Recent interest in low-cost fault detection • ReStore [DSN`05] • SWAT [ASPLOS`08] • Shoestring [ASPLOS`10] • Not perfect…but very low-cost • Reliable • Output • Reliable • Output 4

The Role of Idempotence • Mathematical Definition: • an operation that can be applied multiple times without changing the result • Computer Science Definition: • a region of code without any • exposed write-after-read • (WAR, anti-) dependencies … X = … … = X … … = X Idempotent code regions can be safely re-executed without additional checkpointing X++ X++ … X Idempotent Non-idempotent 5

Does Idempotence Exist? • Selectively checkpointing a *few* offending stores 6

Challenges to Exploiting Idempotence bb’ • Must identify where to resume execution • Control flow • Rollback distance • Statically identifying optimal rollback distance is inherently intractable • ↑ rollback dist. → ↑ Pr(recoverable) • ↓ rollback dist. → ↑ Pr(idempotent) • Simplifying engineering solution based on single-entry, multiple-exit (SEME) regions bb 1 bb 2 bb 3 X a X bb 4 X bb 5 bb 6 bb 6 bb 7 Execution Path 7

Fault Detected Encore Vision Redirect Control Recovery Recovery … = X Restore State Chkpt X Chkpt X … = X Source Code …= X X++ X++ …= X X++ … Non-idempotent Idempotent … Runtime Behavior (post-fault) Code Partitioning (CFG-based) Instrumentation (per region) Idempotence Analysis (per region) 8

Identifying Idempotence (High-level) • With respect to a point, p, in the CFG… • Reachable Stores (RS) • A store that may execute after p • Guarded Addresses (GA) • An address that is guaranteed to be overwritten before reaching p • Exposed Addresses (EA) • An address that may be referenced by an unguarded load prior to p • Idempotent IFF • EA ∩ RS = Ø bb 1 bb 1 • Additional Details… • 1) Applies to both memory and registers • Static, conservative alias analysis • 2) Scalable hierarchical analysis • Handles cyclic code bb 2 bb 2 bb 3 bb 4 bb 3 bb 3 bb 4 bb 4 bb 6 bb 6 bb 5 bb7 bb7 bb 8 bb 8 9

Code Instrumentation Upon Fault Detection bb r bb r … 1: Store A … bb 0 bb 1 Recovery Code • Encore Heuristics • Selectively prune dynamically-dead code • ↓ offending stores → ↑ Pr(idempotent) • 2) Selectively fuse adjacent regions • ↑ region size → ↑ Pr(recoverable) • 3) Selectively instrument profitable regions Live-in Checkpointing bb 2 … 2: Store B … 3: Store C … … 4: Load A … 5: Store C … # bb 3 bb 4 … 7: Load B … 8: Load C … … 6: Load B … $ bb 6 bb 5 @ … 9: Store A … 10: Store B … 11: Load C … bb7 # MemCopy B Save Address[B] Save R1 Save R2 … Save Rn *Restore B Restore R1 Restore R2 … Restore Rn *Restore B “On-demand” Checkpointing $ … 12: Store C … @ + + bb 8 10

Lightweight Checkpointing 1 reg2mem store 1 mem2mem copy 1 stack ptr increment data_N addr_N Stack grows dynamically to accommodate checkpoint storage STACK Encore Extensions data_1 addr_1 1 reg2mem store data_0 addr_0 Live-in Registers Local Variables Traditional Call Stack Return Address Input Parameters Stack Pointer Frame Pointer 11

Evaluation Methodology • Program analysis/instrumentation performed in the LLVM compiler • In-order, single-issue, embedded-class processor • Dynamic instruction model based on profiled execution • Reliability coverage • Analytical model in lieu of traditional fault injection • Decouples evaluation from microarchitectural details 12

Inherent Idempotence 0% (dynamically-dead) <5% <10% 76% of application code is naturally idempotent 13

Dynamic Execution Breakdown • Impact of detection latency • If control has left the region containing the original fault site, re-execution cannot correct the error 91% of execution time is spent within recoverable regions 14

Full System “Coverage” Existing (~100 instrs) Future (~10 instrs) Future (~1000 instrs) 93% − 99.99% coverage, highly application dependent 15

Overheads 3% − 22% performance degradation 16

Summary • Large portions of applications, across domains, are (probabilistically) idempotent • Encore is a software-only solution that exploits this property to provide low-cost fault recovery • 97% of faults on average are recoverable with current detection schemes • @ 15% performance penalty • Implementing Encore in a runtime system / virtual machine has the potential to yield even better results • Larger dynamic traces v. static intervals • Dynamic v. static memory analysis 17

Questions? http://cccp.eecs.umich.edu 18

Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *