1 / 15

Hardware Faults on CMPs 15-740 In-Class Discussion 10/15/2009

Hardware Faults on CMPs 15-740 In-Class Discussion 10/15/2009. Lavanya Subramanian B. Aditya Prakash Sven Stork. What faults do we tolerate?. Manufacturing Faults Hopefully caught during IC testing Soft Errors Caused by particle hits One-time event Intermittent Errors

lauren
Download Presentation

Hardware Faults on CMPs 15-740 In-Class Discussion 10/15/2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware Faults on CMPs15-740 In-Class Discussion10/15/2009 Lavanya Subramanian B. AdityaPrakash Sven Stork

  2. What faults do we tolerate? • Manufacturing Faults • Hopefully caught during IC testing • Soft Errors • Caused by particle hits • One-time event • Intermittent Errors • First indicators of worse to come • Hard Errors • Render a piece of logic unusable

  3. Where to handle/abstract? • Levels of fault tolerance/handling • Hardware • Software • Firmware Wells et al. [2008] take the firmware route • Virtual CPU interface presented to software • A VCPU maps to one or more CPUs

  4. Intermittent Fault Tolerance • Conventional Techniques • Pause Execution • Spare Cores • Reconfiguration

  5. Any potential issues in these?? • Pause Execution • Unfair • Reconfiguration • Throw it to the software!!! Long time • Spare Core • Too risk averse, Big overhead

  6. Paper’s Approach Thumbs Up • Not visible to software • Low area overhead • Fairness can be enforced Thumbs Down • Works effectively only for intermittent faults

  7. Mixed-Mode Reliability • Dual-Modular Redundancy • Run same thread on 2 cores • DMR can sometimes be a liability! • media software, streaming etc. • Hence mixed-mode necessary • Idea is simple: Use DMR for reliable applications, and turn off DMR for performance applications.

  8. Challenges • Isolate memory and register state of reliable applications • System-sofware protection • In face of scheduling high-performance apps • Low overhead of transitions from perf. to reliable mode • Transitions themselves have to be reliable!

  9. Solutions • Protecting Memory • PAT : Protection Assistance Table • Extra bit for whether page can be accessed in perf. mode • PAB : cache of PAT entries • Verify writes after TLB but before accessing rest • Scheduling and Transitions • MMM-IPC • Idle redundant cores – saves on sync. Delays • Protecting registers during transitions is easy

  10. Solutions… • MMM-TP • Use all available cores to run additional threads • Protecting registers is complicated • Maintain dirty bits in the mute core’s cache • For scheduling, use the previously demonstrated virtualization technique • Low overhead of transition • High throughput gain

  11. Handling hard-faults (Powell et al. 09) • Well understood for regular structures (e.g., cache) • What if one core is defect ? • Disable core -> costs performance • Spare core -> cost die area • Is there something between 0 and 1? • Core salvaging -> disable defect parts of core

  12. Microarchitectural Redundancy • Or how redundancy is inside a core to tolerate a fault? • Look at instructions and structure replication:

  13. Core Salvaging • “Sometimes there is something between 0 and 1” • CPU need only to support critical operations • E.g., load, store, … • Other operations can be outsources • Either to other core • Or swap thread if busy

  14. Is Salvaging good? • Pro • Uses mainly existing CPU features and require only minor changes/extensions. • Contra • Paper focuses in short to near-future processors • If you have 100s of cores does a few less cores really matter ? (authors have even table showing this fact) • Paper only analyses only 1 defect core • What’s the worth of a 7.134 core die?

  15. Discussion/Research Questions • Prevention better than cure? • Wear-out/reliability awareness • All cores are homogeneous? • Remember paper on Tuesday! process variation • Where to handle the errors? • Hardware, software, firmware? Flexibility vs perf.? • What about scalability ? • How to manage 100s of partly defect cores ?

More Related