410 likes | 630 Views
Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs. Thesis Proposal Tue. Dec. 18th 2007 Kypros Constantinides Advisor: Todd Austin Department of Electrical Engineering and Computer Science University of Michigan. Transient Faults (due to natural radiation).
E N D
Online Low-Cost Defect Tolerance Solutionsfor Microprocessor Designs Thesis Proposal Tue. Dec. 18th 2007 Kypros Constantinides Advisor: Todd Austin Department of Electrical Engineering and Computer Science University of Michigan
Transient Faults (due to natural radiation) Increased Heating Thermal Runaway Higher Power Dissipation Higher Transistor Leakage Reliability Challenges of Technology Scaling • Age-related wearout • Electromigration • Gate-oxide breakdown (TDDB) Parametric Process Variation (Uncertainty in device & environment) Manufacturing Defects (that escape testing and burn-in) Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Reliability Challenges of Technology Scaling Further scaling is not profitable product cost Cost 1) Cost of built-in defect tolerance mechanisms 2) Cost of R&D needed to develop reliable technologies reliability cost cost per transistor reliability cost Silicon Process Technology Suggested Approach 1) Build products out of unreliable components/technologies 2) Provide reliability through very low cost defect-tolerance techniques Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Presentation Outline • Previous Work – Traditional Techniques • Preliminary Results • BulletProof – A Hardware-Based Defect Tolerance Technique • ACE Testing – A Software-Based Defect Tolerance Technique • Future Work • Timeline Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Traditional Defect Tolerance Techniques • Used at high-end life-critical systems (e.g., aviation) • Triple Modular Redundancy (voting scheme) • N-Version Hardware 2-Version Hardware Triple Modular Redundancy Module Processor Type A Checker Voting Logic Module Processor Type B Module Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Examples of More Recent Research Approaches • Processor Checking (DIVA – Austin, MICRO’99) • Task Checking (Argus – Meixner, MICRO’07) Processor Checking Task Checking Main Processor Control-Flow Checker Data-Flow Checker Main Processor Processor Checker Memory Checker Computation Checker Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Shortcomings of Existing Techniques • Existing techniques continuously check for execution errors • Redundant computation requires significant extra hardware – high area overhead • Continuous checking consumes significant energy – pressure on power budget • Suitable for high-end or life-critical systems BUT, too costly to employ for mainstream systems Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Thesis Goal • Thesis: Defect-tolerance techniques can provide the same level of reliability as traditional techniques, but at a much lower cost. Goals: • Area Cost • Ultra low-cost solution < 5% • Provided Reliability • ~99% of defects are detectable and recoverable • Performance • Low runtime performance overhead (due to testing) < 10% • After recovery the system still operates in degraded performance mode < 10% Reliability ~99% Thesis Goal Area < 5% Performance < 10% Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Presentation Outline • Previous Work – Traditional Techniques • Preliminary Results • BulletProof – A Hardware-Based Defect Tolerance Technique • ACE Testing – A Software-Based Defect Tolerance Technique • Future Work • Timeline Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
BulletProof Pipeline - Overview (ASPLOS06, DATE07) Speculative state during checkpoint interval On-line distributed testing using checkers Checkpoint Distributed Testing Checkpoint COMPUTATION COMPUTATION Checkpoint Interval Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
BulletProof: Distributed Testing and Recovery ID/ EX EX/ MEM MEM /WB IF/ ID X LOCAL TESTER CHECKER LOCAL TESTER CHECKER LOCAL TESTER CHECKER LOCAL TESTER CHECKER LOCAL TESTER CHECKER LOCAL TESTER CHECKER LOCAL TESTER CHECKER LOCAL TESTER CHECKER Checkpoint Recovery Computational Epoch X Computation Testing Testing Reconfig Computation No Testing Computation Time Fault Detected Fault Manifests Testing Complete State Checkpoint Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Experimental Methodology – Baseline Architecture • Baseline Architecture: • 5-stage 4-wide VLIW architecture, 32KB I-Cache, 32KB D-Cache • Embedded designs: Need high reliability with high cost sensitivity • Circuit-Level Evaluation: • Prototype with a physical layout (TSMC 0.18um) • Accurate area overhead estimations • Accurate fault coverage area estimations • Architecture-Level Evaluation: • Trimaran toolset & Dinero IV cache simulator • Average computational epoch size • Performance while in graceful degradation • Benchmarks • SPECINT2000, MediaBench, MiBench Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Design Defect Coverage • Defect Coverage: total area of the design in which a defect can be detected and corrected IF 92.5% ID 93.6% EX 97.7% MEM 92.6% WB 92.7% Overall Design Defect Coverage 95.2% Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Area Overhead Summary EX 11.09% (86%) RF 1.26% (9.8%) Overall design area cost 12.9% ID 0.22% (1.7%) IF 0.07% (0.6%) WB 0.06% (0.5%) L1 I-Cache 0.08% (0.66%) L1 D-Cache 0.08% (0.66%) Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
BulletProof Summary Provided Reliability 95.2% Runtime Performance Overhead < 1 % BulletProof Pipeline Silicon Area Cost 12.9% Trade-off runtime performance to get lower area overhead and higher reliability Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Presentation Outline • Previous Work – Traditional Techniques • Preliminary Results • BulletProof – A Hardware-Based Defect Tolerance Technique • ACE Testing – A Software-Based Defect Tolerance Technique • Future Work • Timeline Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Software-Based Defect Detection (MICRO’07) • Move the hardware checking overhead to software • Firmware periodically stalls the processor and perform hardware checking • Provide architectural support to the software checking routines • Advantages over hardware-based techniques • - Lower area overhead • - Higher runtime flexibility • - it can support multiple fault models • - dynamic tuning of testing process • - Easier to upgrade (software patches) FIRMWARE Periodically stalls the processor and run hardware checking routines Architectural support to software-based checking Accessibility Controllability ? ? Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Access-Control Extensions (ACE) Framework Applications Operating System Software ACE Firmware • Architectural support that enables software access to the processor state (ACE Hardware) • Special Instructions can access and control any part of the processor state (ACE Instructions) • Firmware can periodically run directed hardware tests (ACE Firmware) ACE Extension ISA ACE Hardware Hardware Processor State Processor Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Accessing The Processor State (ACE Hardware) • We leverage the existing full hold-scan chain infrastructure • Full hold-scan chains are employed by most modern processors to improve/automate manufacturing testing Scan State (shadow processor state) Processor State Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Accessing The Processor State (ACE Hardware) • ACE Instructions can move values from the architectural registers to the scan state and vice versa • ACE Instructions can swap data between the scan state and the processor state Register File ACE Tree ACE Node ACE Node ACE Node ACE Node ACE Node ACE Node Scan State Processor State Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Software-based Testing & Diagnosis (ACE Firmware) ATPG Automatic test pattern & response generation • Step 1: Load test pattern into scan state • Step 2: 3 cycle atomic test operation • Cycle 1: Swap scan state with processor state • Cycle 2: Test cycle • Cycle 3: Swap scan state with processor state • Step 3: Validate test response MEMORY Test Patterns Test Responses Register File ACE Node ACE Node ACE Node ACE Node ACE Node ACE Node Processor State Test Response Test Pattern Scan state Validation X Test Response Processor State Test Pattern Processor state Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Timeline of Software-Based Testing Software-based testing is coupled with a checkpointing and recovery mechanism • Functional software test • Check if the core is capable to run ACE-based testing • Limited fault coverage 60-70% • Very fast < 1000 instructions • Directed ACE-based testing • High-quality testing (ATPG patterns) • High fault coverage ~99% • Runtime < 1M instructions Checkpoint Functional Test ACE-based Test Checkpoint COMPUTATION COMPUTATION Checkpoint Interval Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Experimental Methodology • OpenSPARC T1 CMP – based on Sun’s Niagara • Synopsys Design Compiler to synthesize the OpenSPARC CMP • Synopsys TetraMAX ATPG tool for test pattern generation • RTL implementation of ACE framework to get area overhead • Microarchitectural Simulation to get performance overhead • SESC cycle-accurate simulator • Simulate a SPARC core enhanced with the ACE framework • Benchmarks from the SPEC CPU2000 suite Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Preliminary Functional Testing • Fault injection campaign on a gate-level netlist of a SPARC core • Software functional test – 3 phases (~700 instructions): • Control flow check • Register access • Use all ISA instructions • Functional testing coverage is low ~ 62% • Undetected faults do not affect the execution of ACE firmware • Full coverage provided with further ACE-based testing Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Full-chip Distributed ACE-based Testing • Chip testing is distributed to the eight SPARC cores • Testing for stuck-at and path-delay fault models Cores [2,4] Test Instructions: 468K Coverage: 98.7% Cores [0,1] Test Instructions: 312K Coverage: 99.6% Cores [3,5] Test Instructions: 405K Coverage: 98.8% Cores [6,7] Test Instructions: 333K Coverage: 99.9% Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Performance Overhead of ACE-Based Testing • Performance overhead depends on the fault model used to generate patterns • ACE framework is flexible to support test patterns from different fault models 100M Checkpoint Interval SPEC CPU2000 Average Higher quality testing Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
ACE Framework Area Overhead • RTL implementation of ACE Framework in Verilog • Explored several ACE tree configurations • 8 ACE trees (1 per core) to cover OpenSPARC ~230K ACE accessible bits Area Overhead: 0.7% each tree 5.8% for ACE framework Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
ACE Testing Summary Provided Reliability ~99% Runtime Performance Overhead 5-25% BulletProof Pipeline Silicon Area Cost 5.8% Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Contributions to Date - Acknowledgements • BulletProof Pipeline (ASPLOS’06, DATE’07) • Todd Austin and Valeria Bertacco (project supervision) • Smitha Shyam and Sujay Phadke (ASPLOS’06) • Physical prototype implementation • Distributed Checkers • Mojtaba Mehrara and Mona Attariyan (DATE’07) • Added soft-error detection to BulletProof pipeline • Increased the fault coverage of the technique (protection for control logic) • ACE Testing Framework (MICRO’07) • Todd Austin, Onur Mutlu and Valeria Bertacco (project supervision) Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Presentation Outline • Previous Work – Traditional Techniques • Preliminary Results • BulletProof – A Hardware-Based Defect Tolerance Technique • ACE Testing – A Software-Based Defect Tolerance Technique • Future Work • Timeline Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Overview of Future Research Directions Add value to already proposed techniques Online Defect Detection & Diagnosis - BulletProof Pipeline - ACE Testing Online Low-cost Defect Tolerance Solutions Online System Repair • - Low overhead periodic • checkpoint and recovery • - Existing mechanisms: • ReVive + ReViveI/O • SafetyNet Online System Recovery Evaluation Infrastructure Fault Injection Based Analysis Framework Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Extend the ACE Framework to Other Applications Overhead of ACE framework can be amortized by other applications: Online Defect Detection & Diagnosis Online Performance Monitoring ACE Framework ACE Firmware Hardware accessibility & controllability PROCESSOR Online Design Bug Detection Manufacturing Testing Post-silicon Debugging Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Flexible Event Monitoring Architecture • Event monitoring requires real-time signal monitoring/processing • Event monitoring hardware: • Bug signature checkers • Performance counters Register File Programmable Logic Core ACE Node ACE Node ACE Node ACE Node ACE Node ACE Node Support of monitoring capabilities for all ~230K bits of OpenSPARC is very expensive ~25-30% area overhead Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Design Bugs - Preliminary Analysis • Most bugs are in complex control logic • Memory subsystem (lsu) • Exception/interrupt control (tlu) • Load/Store Unit (lsu) & Trap Logic Unit (tlu) account for 96% of the design bugs in the OpenSPARC core • They account only for the 49% of the core’s scan cells Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Current Software-Based Fault Simulation Framework A Monte Carlo-based fault simulation & analysis framework Fault simulation & analysis speed ~ 10KHz Fault Models • Logic masked • Timing masked • Architecture masked • Error (fault manifests) Type, Time, Location, Duration Fault-Exposed Model Design Stimuli • Supported Models • Stuck-at • Stuck-open • Bridge • Path-delay • Transient (SEU) Fault is Fault Analyzer Gate-level Netlist Golden Model (no faults injected) Monte Carlo Simulation loop – 1000x Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Hardware Accelerated Fault Simulation Port the software-based fault simulation & analysis framework on the BEE2 hardware emulation platform In collaboration with: - Andrea Pellegrini - Dan Zhang BEE2 Emulation Board Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Online Repair Techniques • Qualitatively evaluate the effectiveness of graceful degradation that exploits existing resource redundancy • But different architectures have different degrees of resource redundancy • For what defect rates is a given degree of resource redundancy adequate? • Is graceful degradation enough? Do we need to spare? If yes, what to spare? 2-cores 8-cores 80-tiles Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Presentation Outline • Previous Work – Traditional Techniques • Preliminary Results • BulletProof – A Hardware-Based Defect Tolerance Technique • ACE Testing – A Software-Based Defect Tolerance Technique • Future Work • Timeline Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Thesis Completion Timeline Internship? Jan’08 Mar’08 May’08 Jul’08 Sept’08 Nov’08 Jan’09 Mar’09 May’09 ACE Framework Extensions (other App) Alternative Thesis Defense Exploration of Online Repair Techniques ACE Framework Journal Submission CrashTest Resiliency Framework on BEE2 Thesis Defense IEEE Transactions on Computers MICRO’08 or ASPLOS’09 DAC’09 DSN’09 Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs
Thank You!Questions? Online Low-Cost Defect Tolerance Solutions for Microprocessor Designs