Autonomic Computing via Dynamic Self-Repair

Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering Duke University

A Computing Challenge for NASA • NASA relies on computers • NASA is much more demanding than most users • Must operate in harsh environments that cause hard faults • Must operate correctly for years • Must not require human to repair problems • Our goal • Designing autonomic computer systems • Permanent faults will occur and computer will handle them

But Isn’t This a Solved Problem? • We could just use TMR (triple modular redundancy) CPU voter CPU output CPU • But too much power usage to be feasible • Especially for modern microprocessors

Key Observation • Computer hardware is already modular • Improves performance • Simplifies design and verification • Modular exists at many levels • Multiple processors per chip (CMP) • Multiple thread contexts per processor • Multiple functional units (e.g., adders) per processor • Multiple 4-bit adders in 64-bit adder • Multiple 1-bit adders in 4-bit adder • Etc. We can leverage this modularity!

Modular Redundancy • If computer has N widgets, add extra widget(s) • Then provide: • Ability to detect errors • Ability to diagnose hard faults (that cause errors) • Ability to reconfigure and map in spare widget • Cost: 1/N (or 2/N) instead of 2*N for TMR • Benefit: can sometimes even be better than TMR! • Simplistic example: • For processor with 8 adders, providing 2 more adders can tolerate 2 hard faults (in adders) • Replicating entire processor 3 times (TMR) can only tolerate one hard fault (in an adder)

HMR: Hierarchical Modular Redundancy • Provide modular redundancy at many levels • Processors, adders, multipliers, etc. • Engineering issues involved in HMR • Allocating resources • Managing costs

Allocating Resources • For given hardware budget, how to allocate it • Which level to allocate spares? • Better to have extra processor? • Or extra adders in each processor? • Or some combination of both? • How many spares at each level? • Can a spare be mapped in anywhere in system?

Managing Costs • Costs: extra modules, wires, and multiplexers • Example: 3-bit addition, with module = 1-bit adder A1 adder C1 mux B1 mux adder A2 mux C2 mux B2 mux adder mux C3 mux A3 adder B3

Current Research Thrust #1 • Explore modular redundancy within microprocessor • Add extra array entries • In reorder buffer (ROB), branch history table (BHT), etc. • Add extra functional units • Adders, multipliers, etc. • For error detection • Use “DIVA” or redundant threads • For hard fault diagnosis • Use threshold error counters • For reconfiguration • Use extra wires and multiplexers Modular array entry design published in International Symposium on Dependable Systems and Networks, 2004

Current Research Thrust #2 • Explore modular redundancy within 64-bit adder • Start with 64-bit carry lookahead adder (CLA) • Hierarchy of 4-bit CLA modules • Add 2 extra modules • Detect errors as before • Diagnose with counters and pattern matching • Based on error counter values, can diagnose fault! • Reconfigure with clever multiplexing scheme

Conclusions and Future Work • Hierarchical Modular Redundancy can provide high reliability at relatively low cost • Future directions • Low-level: modular designs of components besides just adders (e.g., multipliers, decoding logic, etc.) • Mid-level: modular designs of microprocessors that can tolerate loss of currently critical logic (e.g., decoding) • High-level: HMR for chip multiprocessors

Acknowledgments Several collaborators on this work • Co-Investigator Prof. Sule Ozev (Duke ECE) • Fred Bower (Duke CS grad and IBM) • Mahmut Yilmaz (Duke ECE grad) • Derek Hower (Duke ECE undergrad)

Autonomic Computing via Dynamic Self-Repair

Autonomic Computing via Dynamic Self-Repair

Presentation Transcript

The Vision of Autonomic Computing

Autonomic Computing

Autonomic Computing and Networking

*An Integrated Self-Testing Framework for Autonomic Computing Systems

Engineering Self-Testable Autonomic Software

Autonomic Computing: Model, Architecture, Infrastructure

Autonomic Computing

Chapter 8: Autonomic computing

IBM Initiatives in Autonomic Computing

Autonomic Computing

Autonomic Computing

AUTONOMIC COMPUTING

Self-Stabilizing Systems as a Base for Autonomic Computing

Dynamic Point Location via Self-Adjusting Computation

THE VISION OF AUTONOMIC COMPUTING

Research Challenges in Autonomic Computing

Engineering Decentralized Autonomic Computing Systems

Towards Self-Testing in Autonomic Computing Systems

Self-Stabilizing Systems as a Base for Autonomic Computing

Autonomic Computing

Autonomic Computing

THE VISION OF AUTONOMIC COMPUTING