200 likes | 361 Views
Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke. W-QUAD (ISCA-35) June 21, 2008. [Srinivasan, DSN‘04]. [Borkar, MICRO‘05]. Motivation. “Designing Reliable Systems from Unreliable Components…”
E N D
Olay: Combat the Signs of Aging with Introspective Reliability Management Authors: Shuguang Feng Shantanu Gupta Scott Mahlke W-QUAD (ISCA-35) June 21, 2008 1
[Srinivasan, DSN‘04] [Borkar, MICRO‘05] Motivation • “Designing Reliable Systems from Unreliable Components…” - Shekhar Borkar (Intel) Failures will be wearout induced More failures to come 2
Approaches to Reliability Approaches to Reliability Tolerate Faults (reactive) or… Prevent Faults (proactive) Architecture-level • Detect • Diagnose • Repair/reconfigure/recover • High-K dielectrics • Passivation • Dynamic thermal mgmt (DTM) • Introspective reliability mgmt (IRM) Diva Circuit-level WDU Reliability Banking • Margining • Robust cell topologies Heat-and-Run Argus RAMP Targeted management based on wearout monitoring 3 3
Not All Cores Are Created Equal • Chip-multiprocessors will be subject to severe process variation • Dynamic thermal/power budgeting can be suboptimal • Temperature is only part of the picture • Need low-level reliability awareness • Low-level sensors measure physical changes • Wearout-aware management improves reliability enhancement • System reconfiguration • Dynamic voltage and frequency scaling (DVFS) • Job assignment 4
Introspective Reliability Management (IRM) • WDU [MICRO`07] • measure propagation • delay • track statistical trends • Olay • track the progression of • wearout • profile workload behavior • generate wearout-aware • job schedules • Low-level Sensors • delay • leakage • temperature • etc. 5
Wearout-aware Scheduling Per-module Reliability Profile T0 T1 T2 T3 Tn 10% 50% 75% 15% 25% 35% 25% 25% 45% 5% 85% 35% Activity: Active Jobs Available Cores Job Schedule 6
Wearout-aware Scheduling T0 T1 T2 T3 Tn 7
Wearout-aware Policies • GreedyE • Optimizes for early life performance • Minimizes premature failures with wear-leveling T13 T8 T9 T3 T5 Tn T4 T3 T9 T5 T7 Tn C6 C1 C3 C10 C4 Cn T12 T3 T9 T5 T4 Tn C7 C6 C1 C3 C10 Cn C1 C3 C10 C4 C0 Cn T0 T1 T2 T3 T4 Tn C0 C1 C2 C3 C4 Cn Weak Light T11 T13 T0 T7 T5 T2 T4 T12 T15 T6 T8 T3 T1 T10 T15 T9 Heavy Strong Cores Jobs Schedule 8
Wearout-aware Policies • GreedyE • Optimizes for early life performance • Minimizes premature failures with wear-leveling • GreedyL • Optimizes for end of life performance • Victimizes weak cores to maximize the life of stronger cores • GreedyA • Hybrid of GreedyE and GreedyL • Adapts behavior based on system utilization 9
Lifetime Reliability Simulation (FACE) Offline Characterization SPEC2000 (INT & FP) • Synthetic Benchmarks • representative of SPEC2000 • suite • reduces online profiling • complexity Temperature Trace Execution Trace Power Trace 10
Lifetime Reliability Simulation (FACE) Offline Characterization • Reliability Management • monitors CMP health • wearout-aware scheduling • profiling • intelligent heuristics • Parameter Specification • Device lifetimes • Utilization pattern • Simulate CMP Aging • tracks progression of • wearout mechanisms • hierarchical design • Workload Generation • emulates OS scheduler • temperature traces • power traces Online Simulation 11
Wearout Modeling • Mean time to failure (MTTF) • defines distribution of device lifetimes • Damage accumulation • where α is the degradation rate 12
CMP Reliability Simulation • CMPs: • variable number of cores • model systematic variation • Cores: • Alpha 21264-type processor • Modules: • experience load-dependent stress • smallest granularity of • temperature modeling • Transistors: • multiple mechanisms evolve • independently 13
Evaluation • Policies • Random (baseline), GreedyE, GreedyL, GreedyA • Figures of merit • Failure distribution • Useful work performed prior to system failure • Varied system parameters • CMP size • System utilization • Sensor error 14
Failure Distribution w/ 16-cores 15
Sensitivity to System Utilization w/ 16-cores 16
Sensitivity to CMP Size w/ 100% utilization & GreedyE 17
Sensitivity to Sensor Error w/ 16-cores,100% utilization, & GreedyE 18
Conclusions • Heterogeneity exists in both CMPs and their workloads • Wearout-aware job assignments effectively exploit this heterogeneity • Real-time health monitoring (low-level sensors) • CMPs augmented with Olay perform up to 20% more useful work • Proper high-level analysis and profiling is essential for enhancing lifetime reliability. 19
Questions? ? 20