120 likes | 274 Views
Autonomic Computing via Dynamic Self-Repair. Daniel J. Sorin Department of Electrical & Computer Engineering Duke University. A Computing Challenge for NASA. NASA relies on computers NASA is much more demanding than most users Must operate in harsh environments that cause hard faults
E N D
Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering Duke University
A Computing Challenge for NASA • NASA relies on computers • NASA is much more demanding than most users • Must operate in harsh environments that cause hard faults • Must operate correctly for years • Must not require human to repair problems • Our goal • Designing autonomic computer systems • Permanent faults will occur and computer will handle them
But Isn’t This a Solved Problem? • We could just use TMR (triple modular redundancy) CPU voter CPU output CPU • But too much power usage to be feasible • Especially for modern microprocessors
Key Observation • Computer hardware is already modular • Improves performance • Simplifies design and verification • Modular exists at many levels • Multiple processors per chip (CMP) • Multiple thread contexts per processor • Multiple functional units (e.g., adders) per processor • Multiple 4-bit adders in 64-bit adder • Multiple 1-bit adders in 4-bit adder • Etc. We can leverage this modularity!
Modular Redundancy • If computer has N widgets, add extra widget(s) • Then provide: • Ability to detect errors • Ability to diagnose hard faults (that cause errors) • Ability to reconfigure and map in spare widget • Cost: 1/N (or 2/N) instead of 2*N for TMR • Benefit: can sometimes even be better than TMR! • Simplistic example: • For processor with 8 adders, providing 2 more adders can tolerate 2 hard faults (in adders) • Replicating entire processor 3 times (TMR) can only tolerate one hard fault (in an adder)
HMR: Hierarchical Modular Redundancy • Provide modular redundancy at many levels • Processors, adders, multipliers, etc. • Engineering issues involved in HMR • Allocating resources • Managing costs
Allocating Resources • For given hardware budget, how to allocate it • Which level to allocate spares? • Better to have extra processor? • Or extra adders in each processor? • Or some combination of both? • How many spares at each level? • Can a spare be mapped in anywhere in system?
Managing Costs • Costs: extra modules, wires, and multiplexers • Example: 3-bit addition, with module = 1-bit adder A1 adder C1 mux B1 mux adder A2 mux C2 mux B2 mux adder mux C3 mux A3 adder B3
Current Research Thrust #1 • Explore modular redundancy within microprocessor • Add extra array entries • In reorder buffer (ROB), branch history table (BHT), etc. • Add extra functional units • Adders, multipliers, etc. • For error detection • Use “DIVA” or redundant threads • For hard fault diagnosis • Use threshold error counters • For reconfiguration • Use extra wires and multiplexers Modular array entry design published in International Symposium on Dependable Systems and Networks, 2004
Current Research Thrust #2 • Explore modular redundancy within 64-bit adder • Start with 64-bit carry lookahead adder (CLA) • Hierarchy of 4-bit CLA modules • Add 2 extra modules • Detect errors as before • Diagnose with counters and pattern matching • Based on error counter values, can diagnose fault! • Reconfigure with clever multiplexing scheme
Conclusions and Future Work • Hierarchical Modular Redundancy can provide high reliability at relatively low cost • Future directions • Low-level: modular designs of components besides just adders (e.g., multipliers, decoding logic, etc.) • Mid-level: modular designs of microprocessors that can tolerate loss of currently critical logic (e.g., decoding) • High-level: HMR for chip multiprocessors
Acknowledgments Several collaborators on this work • Co-Investigator Prof. Sule Ozev (Duke ECE) • Fred Bower (Duke CS grad and IBM) • Mahmut Yilmaz (Duke ECE grad) • Derek Hower (Duke ECE undergrad)