1 / 5

Introspective Fault Tolerance for Exascale Systems

Introspective Fault Tolerance for Exascale Systems. Rinku Gupta, Kamil Iskra , Kazutomo Yoshii, Pavan Balaji , Pete Beckman. Motivation. Datacenter: 10 9 threads. Exascale systems will have faults Power constraints, high-density silicon Number of hardware/software components

dori
Download Presentation

Introspective Fault Tolerance for Exascale Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introspective Fault Tolerance for Exascale Systems Rinku Gupta, KamilIskra, Kazutomo Yoshii, PavanBalaji, Pete Beckman

  2. Motivation Datacenter: 109 threads • Exascale systems will have faults • Power constraints, high-density silicon • Number of hardware/software components • Both hardware and software have a role to play • Hardware techniques • ECC checks, 2D error coding • Can get too expensive when bit rates increase (both cost and power) • Software techniques need to complement hardware resilience with clearly defined roles • Mechanisms are needed for lower-level hardware and operating system to interface with upper levels for end-to-end for resiliency and fault tolerance Rack: 104 - 105 threads Socket: 500 -5000threads Die: 100 -1000threads Core/tile: 1 -10 threads Image courtesy of Intel : SC’11 BOF on Resilience S/W on Exascale Computing

  3. Introspective Fault Tolerance • Current fault exchange models are too simplistic: • OS kills the application on a hard error • OS/hardware returns an error code saying something bad happened • Hardware/OS/low-level runtime automatically corrects errors and hides it from the application • The fundamental concept of introspective fault tolerance: multi-way communication mechanism between operating system, runtime systems and applications • Hardware/OS/runtime should continue to give information to applications (like they currently do) • Applications/runtime systems should also pass down information (or hints) to the low-level runtime/OS on what they can “get away with” • Tuning tradeoffs based on application characteristics (e.g., OS can turn off ECC checks for some application specified memory regions) • Tradeoffs based on power, performance and resiliency (e.g., lesser voltage means lesser power, but more faults)

  4. Challenges • Research focus for achieving this goal: • Understand what faults/system changes highly impact applications • Understand how to improve fault detection at OS- or system-level • What interfaces are required between operating system and upper-level software? • What techniques would allow upper level software to use information received from OS? • What mechanisms are needed in the OS to manipulate resilience, power and performance • Is hardware prepared for this?

  5. An Example Interface Allocate regular memory • Based on annotations and low-level interfaces/hooks Introspect soft ECC errors Allocate memory with hard error returns Introspect hard ECC errors Allocate unreliable memory Call routines for memory check • Application can query OS for soft/hard error information  decide whether to continue execution or migrate/terminate  better end-to-end fault tolerance

More Related