150 likes | 278 Views
Hardware Faults on CMPs 15-740 In-Class Discussion 10/15/2009. Lavanya Subramanian B. Aditya Prakash Sven Stork. What faults do we tolerate?. Manufacturing Faults Hopefully caught during IC testing Soft Errors Caused by particle hits One-time event Intermittent Errors
E N D
Hardware Faults on CMPs15-740 In-Class Discussion10/15/2009 Lavanya Subramanian B. AdityaPrakash Sven Stork
What faults do we tolerate? • Manufacturing Faults • Hopefully caught during IC testing • Soft Errors • Caused by particle hits • One-time event • Intermittent Errors • First indicators of worse to come • Hard Errors • Render a piece of logic unusable
Where to handle/abstract? • Levels of fault tolerance/handling • Hardware • Software • Firmware Wells et al. [2008] take the firmware route • Virtual CPU interface presented to software • A VCPU maps to one or more CPUs
Intermittent Fault Tolerance • Conventional Techniques • Pause Execution • Spare Cores • Reconfiguration
Any potential issues in these?? • Pause Execution • Unfair • Reconfiguration • Throw it to the software!!! Long time • Spare Core • Too risk averse, Big overhead
Paper’s Approach Thumbs Up • Not visible to software • Low area overhead • Fairness can be enforced Thumbs Down • Works effectively only for intermittent faults
Mixed-Mode Reliability • Dual-Modular Redundancy • Run same thread on 2 cores • DMR can sometimes be a liability! • media software, streaming etc. • Hence mixed-mode necessary • Idea is simple: Use DMR for reliable applications, and turn off DMR for performance applications.
Challenges • Isolate memory and register state of reliable applications • System-sofware protection • In face of scheduling high-performance apps • Low overhead of transitions from perf. to reliable mode • Transitions themselves have to be reliable!
Solutions • Protecting Memory • PAT : Protection Assistance Table • Extra bit for whether page can be accessed in perf. mode • PAB : cache of PAT entries • Verify writes after TLB but before accessing rest • Scheduling and Transitions • MMM-IPC • Idle redundant cores – saves on sync. Delays • Protecting registers during transitions is easy
Solutions… • MMM-TP • Use all available cores to run additional threads • Protecting registers is complicated • Maintain dirty bits in the mute core’s cache • For scheduling, use the previously demonstrated virtualization technique • Low overhead of transition • High throughput gain
Handling hard-faults (Powell et al. 09) • Well understood for regular structures (e.g., cache) • What if one core is defect ? • Disable core -> costs performance • Spare core -> cost die area • Is there something between 0 and 1? • Core salvaging -> disable defect parts of core
Microarchitectural Redundancy • Or how redundancy is inside a core to tolerate a fault? • Look at instructions and structure replication:
Core Salvaging • “Sometimes there is something between 0 and 1” • CPU need only to support critical operations • E.g., load, store, … • Other operations can be outsources • Either to other core • Or swap thread if busy
Is Salvaging good? • Pro • Uses mainly existing CPU features and require only minor changes/extensions. • Contra • Paper focuses in short to near-future processors • If you have 100s of cores does a few less cores really matter ? (authors have even table showing this fact) • Paper only analyses only 1 defect core • What’s the worth of a 7.134 core die?
Discussion/Research Questions • Prevention better than cure? • Wear-out/reliability awareness • All cores are homogeneous? • Remember paper on Tuesday! process variation • Where to handle the errors? • Hardware, software, firmware? Flexibility vs perf.? • What about scalability ? • How to manage 100s of partly defect cores ?