Hardware Faults on CMPs 15-740 In-Class Discussion 10/15/2009

Hardware Faults on CMPs15-740 In-Class Discussion10/15/2009 Lavanya Subramanian B. AdityaPrakash Sven Stork

What faults do we tolerate? • Manufacturing Faults • Hopefully caught during IC testing • Soft Errors • Caused by particle hits • One-time event • Intermittent Errors • First indicators of worse to come • Hard Errors • Render a piece of logic unusable

Where to handle/abstract? • Levels of fault tolerance/handling • Hardware • Software • Firmware Wells et al. [2008] take the firmware route • Virtual CPU interface presented to software • A VCPU maps to one or more CPUs

Intermittent Fault Tolerance • Conventional Techniques • Pause Execution • Spare Cores • Reconfiguration

Any potential issues in these?? • Pause Execution • Unfair • Reconfiguration • Throw it to the software!!! Long time • Spare Core • Too risk averse, Big overhead

Paper’s Approach Thumbs Up • Not visible to software • Low area overhead • Fairness can be enforced Thumbs Down • Works effectively only for intermittent faults

Mixed-Mode Reliability • Dual-Modular Redundancy • Run same thread on 2 cores • DMR can sometimes be a liability! • media software, streaming etc. • Hence mixed-mode necessary • Idea is simple: Use DMR for reliable applications, and turn off DMR for performance applications.

Challenges • Isolate memory and register state of reliable applications • System-sofware protection • In face of scheduling high-performance apps • Low overhead of transitions from perf. to reliable mode • Transitions themselves have to be reliable!

Solutions • Protecting Memory • PAT : Protection Assistance Table • Extra bit for whether page can be accessed in perf. mode • PAB : cache of PAT entries • Verify writes after TLB but before accessing rest • Scheduling and Transitions • MMM-IPC • Idle redundant cores – saves on sync. Delays • Protecting registers during transitions is easy

Solutions… • MMM-TP • Use all available cores to run additional threads • Protecting registers is complicated • Maintain dirty bits in the mute core’s cache • For scheduling, use the previously demonstrated virtualization technique • Low overhead of transition • High throughput gain

Handling hard-faults (Powell et al. 09) • Well understood for regular structures (e.g., cache) • What if one core is defect ? • Disable core -> costs performance • Spare core -> cost die area • Is there something between 0 and 1? • Core salvaging -> disable defect parts of core

Microarchitectural Redundancy • Or how redundancy is inside a core to tolerate a fault? • Look at instructions and structure replication:

Core Salvaging • “Sometimes there is something between 0 and 1” • CPU need only to support critical operations • E.g., load, store, … • Other operations can be outsources • Either to other core • Or swap thread if busy

Is Salvaging good? • Pro • Uses mainly existing CPU features and require only minor changes/extensions. • Contra • Paper focuses in short to near-future processors • If you have 100s of cores does a few less cores really matter ? (authors have even table showing this fact) • Paper only analyses only 1 defect core • What’s the worth of a 7.134 core die?

Discussion/Research Questions • Prevention better than cure? • Wear-out/reliability awareness • All cores are homogeneous? • Remember paper on Tuesday! process variation • Where to handle the errors? • Hardware, software, firmware? Flexibility vs perf.? • What about scalability ? • How to manage 100s of partly defect cores ?

Hardware Faults on CMPs 15-740 In-Class Discussion 10/15/2009

Hardware Faults on CMPs 15-740 In-Class Discussion 10/15/2009

Presentation Transcript

Discussion Class 10

Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs

CLASS DISCUSSION

In-Class Discussion

15-740/18-740 Computer Architecture Lecture 19: Caching II

15-740/18-740 Computer Architecture Lecture 25: Main Memory

Discussion Class

Class Discussion

In class discussion

Class Discussion

Class discussion

A Class Discussion on hell

A Class Discussion on Angels

Class discussion

In-class discussion 1

Class Discussion

On January 15, 2009

Class Discussion

15-740/18-740 Computer Architecture Lecture 12: Issues in OoO Execution

Discussion on significance

“Boys over Flowers” class discussion

Class Review 10/15/15 EW2-060