270 likes | 569 Views
Online Design Bug Detection: RTL Analysis, Flexible Mechanisms, and Evaluation. Kypros Constantinides University of Michigan Onur Mutlu Microsoft Research & Carnegie Mellon University Todd Austin University of Michigan. Challenges of Correct Microprocessor Design.
E N D
Online Design Bug Detection:RTL Analysis, Flexible Mechanisms, and Evaluation Kypros Constantinides University of Michigan Onur Mutlu Microsoft Research & Carnegie Mellon University Todd Austin University of Michigan
Challenges of Correct Microprocessor Design Design Bugs: Deviations from the product specifications • Chip-Multiprocessors • New Features: • 64-bit extensions • Virtualization • Power Management • SSE3 3.5 bugs per month 1.2 bugs per month *Data compiled from Intel product specification updates documents More bugs as more complex and diverse resources are integrated into a single chip Online Design Bug Detection
Why is Online Design Bug Detection Needed? Lower Customer Satisfaction Lower System Performance Financial Loss Expensive Recalls Cost of design bugs Diminishing Brand/Company Reputation System Security: Attacks exploit HW design bugs Microprocessor companies rely on ad-hoc techniques that change the software and hardware configuration to work around design bugs Online Design Bug Detection
Online Design Bug Detection and Avoidance Online Design Bug Detection Online System Recovery Bug Avoidance Techniques • - Recover system from • design bug effects • - Low overhead periodic • checkpoint and recovery • - Existing mechanisms: • ReVive + ReViveI/O • SafetyNet • - Avoid the reoccurrence of • the design bug • - Existing mechanisms: • Scale down to safe-mode • Disable buggy part • Hypervisor execution • guidance Bug detection mechanism is updated by firmware with new design bugs In this work we focus on online design bug detection Online Design Bug Detection
Microprocessor Errata Documents From the Intel Pentium 4 Specification Update Document R31. Interactions between the Instruction Translation Lookaside Buffer (ITLB) and the Instruction Streaming Buffer May Cause Unpredictable Software Behavior Problem: Complex interactions within the instruction fetch/decode unit may make it possible for the processor to execute instructions from an internal streaming buffer containing stale or incorrect information. Implication: When this erratum occurs, an incorrect instruction stream may be executed resulting in unpredictable software behavior. Limitations: - Provide high-level description of the design bug - Hard to relate the design bug to the actual hardware implementation Online Design Bug Detection
Characterizing RTL Design Bugs OpenSPARC T1 (Niagara) OpenSPARC Core - RTL design bugs in Verilog code - Fixed and documented in the code Load Store Unit (LSU): 157 bugs Trap Logic Unit (TLU): 139 bugs Total of 296 bugs in SPARC core MUL EXU IFU Load Store Unit (LSU) MMU Trap Logic Unit (TLU) Example of RTL design bug in Verilog code – tlu_ctl.v 1089: assign intrpt_taken = rstint_taken | hwint_taken | sirint_taken; ... 1105: // modified for bug 3919 1106: // assign trap_to_redmode = trp_lvl_at_maxtlless1 & ~intrpt_taken; 1107: assign trap_to_redmode = trp_lvl_at_maxtlless1 & 1108: ~(rstint_taken | sirint_taken); Buggy Code Corrected Code Online Design Bug Detection
Online Detection of Design Bugs 1089: assign intrpt_taken = rstint_taken | hwint_taken | sirint_taken; ... 1105: // modified for bug 3919 1106: // assign trap_to_redmode = trp_lvl_at_maxtlless1 & ~intrpt_taken; 1107: assign trap_to_redmode = trp_lvl_at_maxtlless1 & 1108: ~(rstint_taken | sirint_taken); Buggy Code Corrected Code Q D Q D Clk Clk Monitoring the flip-flops can detect the bug occurrence Correct Implementation Buggy Implementation trap_to redmode trap_to redmode trp_lvl_at_maxtlless1 = 1 trp_lvl_at_maxtlless1 = 1 0 1 … Combinational Logic rstint_taken = 0 rstint_taken = 0 1 0 hwint_taken = 1 sirint_taken = 0 sirint_taken = 0 Design bug is exposed Monitoring these signals can detect the bug occurrence Online Design Bug Detection
Insights from RTL Design Bug Analysis • RTL Analysis Observations: • ~20 signals need to be monitored per bug • >1000 unique signals need to be monitored for all the bugs studied • Each bug has ~7 source signals not monitored for any other bug Set of monitored signals is expanding for every new bug • All bug source signals are coming from control flip-flops • Monitoring data buffers or data registers will not provide significant benefit Limitations of online bug detection techniques in the literature: 1. Monitor only a few hundreds of signals (~200-300) 2. Monitored signals are selected at design time Online Design Bug Detection
Flexible Bug Detection at the Flip-Flop Level Bug Detection Portion s s s Bug Detection Portion 0: Match 1: Mismatch s Scan Portion Detection Value Monitor ALL control flip-flops in the design s 1 0 0 Monitor Enable Flexible Bug Signature 1 1 0 Operating Flip-Flop 1 X X 0 X X … X X X 0 X X Bug Signature Encoding Load using field programmable scan chains Bug Detection Flip-Flop Operating Flip-Flop FF needs to be 1 to expose bug FF is not a bug Source signal FF needs to be 0 to expose bug Online Design Bug Detection
Distributed Global Bug Detection Checking Bug #12 is detected Checking Tree table entries loaded at system startup by firmware Flag Bug ID Match-bitvector 12 1 1 1 s s … Bug #9 is detected 12, 1 12, 1 Flag Flag Bug ID Match-bitvector Bug ID Match-bitvector 9 7 1 1 X 1 X 1 1 1 X X 12 12 0 0 X 1 1 X 1 X X X s s s s s s s s … … 1 1 0 1 1 1 0 0 1 0 0 1 Flip-Flop Level 8-bit Bug Detection 8-bit Bug Detection 8-bit Bug Detection 8-bit Bug Detection 8-bit Bug Detection 8-bit Bug Detection 8-bit Bug Detection 8-bit Bug Detection 64 Control Flip-Flops Online Design Bug Detection
Detecting Multiple Design Bugs Design Bug Database Design Bugs & Triggering Conditions Bug Signature Conflict 0 Bug Sign.#1 1 Bug Sign.#2 Use “Don’t cares” to resolve signal conflicts between bug signatures … Bug Sign.#N X Merge Bug Signatures Encode & Load No false negatives, but false positive bug detections are possible System Bug Signature Bug Detection Flip-Flops Online Design Bug Detection
Online Tuning of Coverage/Performance Trade-Off Firmware loads initial system bug signature Adjust the design bugs been covered by dunamically updating the system bug signature Execution recovery & design bug avoidance Remove bug with highest false positive rate Add bug with lowest false positive rate Design bug detected No Bug ID# Yes False positive? Bug ID# Log of the false positive rate of each bug Physical Memory Update log Yes False positive rate > threshold? No Online Design Bug Detection
Area Overhead and Design Bug Coverage RTL prototype implementation: - Synthesized with IBM 130nm process technology - Covers the whole OpenSPARC T1 Chip - 39K control flip-flops monitored (15% of all Flip-flops in OpenSPARC T1) - Bug detection flip-flops have an area overhead of 3% 25 Critical Design Bugs in 10 commercial processor ~65% [Sarangi et al., MICRO’06] 80% Coverage 10% Overhead 20 15 Total Area Overhead (%) 10 5 0 Online Design Bug Detection
Power Consumption Overhead Segment Checking Tree (16 entries per node) (0.74W) 1.3% OpenSPARC T1 Power Budget: 58W IBM 130nm @ 1.2V Field Programmable Framework (0.35W) 0.6% 39K Bug Detection Augmented Flip-Flops (0.9W) 1.5% Cores & L1 Caches (14.4W) 24.7% Wires & Repeaters (10.7W) 18.4% 3.5% Power Overhead I/O Pads (6.9W) 12% Misc. Units (I/O Bridge, DRAM Ctrl, CTU) (0.9W) 1.5% L2 Cache (9W) 15.4% Crossbar (0.6W) 1.1% Leakage (13.7W) 23.5% Online Design Bug Detection
Contributions • RTL-level analysis of the design bugs of a commercial processor • Bugs have unique source signals that are hard to predict at design time • Monitored signals need to be selected in the field after bug discovery • Current techniques not flexible enough - select signals at design time • Proposed a flexible online bug detection mechanism • Monitor all control flip-flops in OpenSPARC T1 • Set of monitored signals can be selected in the field using firmware • RTL prototype: 80% bug coverage for 10% area overhead Online Design Bug Detection
Future Work - Evaluation Challenges • Current infrastructure insufficient to measure false positive rate • Functional simulators: Lack of RTL level detail • RTL simulators: Too slow to run applications • Developing a hardware prototype of our framework on FPGA • Uncomment design bug fixes in RTL code of OpenSPARC T1 • Evaluate the effectiveness of our framework on real applications • Measure false positive rate • Explore trade-off between bug coverage and performance Online Design Bug Detection
Thank You!Questions? Online Design Bug Detection
Online Bug Detection & Avoidance: A Microprocessor Airbag • Extra cost without any performance/utility benefits • The microprocessor designers shouldn’t rely on it • No guarantee of success - Doesn’t cover all possible design bugs Car airbags reduce fatalities by 8% when seat belts are worn • Objective: Reduce the risk of serious implications when critical design bugs are discovered after product release Online Design Bug Detection
RTL Algorithmic Design Bugs Design bug in Verilog code – lsu_qctl1.v 2993://bug4814 - change rrobin_picker1 to rrobin_picker2 2993:// Choose one among 4 loads. 2994://lsu_rrobin_picker1 ld4_rrobin ( 2995: //.events({ld3_pcx_rq_vld,ld2_pcx_rq_vld,ld1_pcx_rq_vld,ld0_pcx_rq_vld}), ... 3007://.se(se), 3008://.so() 3009://); 3010: 3011: lsu_rrobin_picker2 ld4_rrobin ( 3012: .events({ld3_pcx_rq_vld,ld2_pcx_rq_vld,ld1_pcx_rq_vld,ld0_pcx_rq_vld}), ... 3020: .se(se), 3021: .so() 3022: ); - Algorithmic deviations from the design specifications - Require major modifications to be fixed Online Design Bug Detection
RTL Timing Design Bugs Design bug in Verilog code – lsu_qdp1.v 1228: // Begin - Bug3487. ... 1239: dff #(48) ifu_std_d1 ( 1240: .din (tlb_st_data[47:0]), 1241: .q (lsu_ifu_stxa_data[47:0]), 1242: .clk (asi_data_clk), 1243: .se (1'b0), .si (), .so () 1244: ); 1245: 1246: // select is now a stage earlier, which should be 1247: // fine as selects stay constant. 1248: //assign lsu_ifu_stxa_data[47:0] = tlb_st_data_d1[47:0] ; 1249: 1250: // End - Bug3487. - Signals need to be latched a cycle earlier or later to keep correctness - Addition or removal of flip-flops is the most common fix Online Design Bug Detection
RTL OpenSPARC T1 Design Bug Distribution Load/Store Unit (LSU) 157 Design Bugs Trap Logic Unit (TLU) 139 Design Bugs Online Design Bug Detection
Power Consumption Estimation Methodology * A. S. Leon, K. W. Tam, J. L. Shin, D. Weisner, and F. Schumacher. A Power-Efficient High-Throughput 32-Thread SPARC Processor, In IEEE Journal of Solid-State Circuits, 42(1), 2006 Online Design Bug Detection
RTL Analysis Results Online Design Bug Detection
Merging Bug Signatures 4-bit Bug Detection Segments Bug Signature #1 Bug Signature #2 Design Bug #1 Intermediate Signature #1 CASE 2 CASE 1 CASE 2 Bug Signature #1 Bug Signature #2 Design Bug #2 Intermediate Signature #2 System Bug Signature Online Design Bug Detection
High-Level Overview Design Bugs & Triggering Conditions If the global bug detection signal flags a bug, system recovery is triggered Global Bug Detection Signal Design Bug Recovery Handler 7 Generate the bug signatures based on bug triggering conditions Aggregate bug detection segment match/mismatch signals to a global bug detection signal Segment Match Detection Table Segment Checking Tree 1 Bug Signature Collection 6 … Firmware loads the segment match detection entries Segment Match Detection Table Segment Match Detection Table … … 4 match/mismatch signals … Bug Detection Segment Bug Detection Segment Bug Detection Segment Bug Detection Segment Merge Bug Signatures Firmware encodes and loads the system bug signature to the bug detection segments 2 … … System State (Flip-Flops) Cycle-by-cycle online checking for design bugs 3 5 System Bug Signature Online Design Bug Detection
OpenSPARC T1 Data & Control Flip-Flops Online Design Bug Detection
Synergistic Online Bug & Defect Detection Online Design Bug Detection