190 likes | 420 Views
Kypros Constantinides ‡ Stephen Plaza ‡ Jason Blome ‡ Bin Zhang † Valeria Bertacco ‡ Scott Mahlke ‡ Todd Austin ‡ Michael Orshansky † ‡ Advanced Computer Architecture Lab † Department of Electrical and Computer Engineering
E N D
Kypros Constantinides‡ Stephen Plaza‡ Jason Blome‡ Bin Zhang† Valeria Bertacco‡ Scott Mahlke‡ Todd Austin‡ Michael Orshansky† ‡Advanced Computer Architecture Lab †Department of Electrical and Computer Engineering University of Michigan University of Texas at Austin BulletProof: A Defect-Tolerant CMPSwitch Architecture
Introduction • Reliability is a critical aspect of any computer design • System designers target for very small failure rates • Today reliability targets are met by using fault-avoidance design techniques • use of conservative design margins • For future process technologies it would be impossible to avoid system failures by using conservative design margins • need defect-tolerant design techniques Now Transistor Reliability Future Transistor Lifetime (years)
Reliable System Design Space • Need for cost- and performance-efficient techniques that can provide high reliability in the presence of unreliable components – “BulletProof” TYPE OF DEFECT DESIGN FEATURE DMR DMR Diva Razor ECC TMR ECC - memory TMR cache-line swap-out memory-array spares BulletProof Mainstream Solutions Research-stage Solutions High-end Solutions Specialized Solutions
CMP Switch Architecture • Goal: A defect tolerant CMP switch design • Baseline switch architecture is provided by Li-Shiuan Peh • Implements the routing and flow-control functions required for transmitting packets in a 2D Torus network • Wormhole switch pipelined at the flit level (32-bit flits) • Dimensional order routing • Specified in Verilog and synthesized to a gate-level netlist ~ 9K logic gates and 1700 sequential elements
Soft Errors (SEU) Vulnerability • In earlier work we studied the vulnerability of the switch architecture to soft-errors • Only 3.2% of faults eventually cause an error • Age-related wear-out silicon defects is a more challenging reliability threat for future technologies • In this work we focus on solutions for in-field silicon defects • These solutions also provide soft-error tolerance to the design
Self-Repairing Systems • Defect-tolerant self-repairing systems need to support: • Error Detection • System Diagnosis (locate the origin of the error) • System Repair • System Recovery • Key idea: • error detection must be performance efficient • continuously check execution for errors • diagnosis, repair and recovery are insensitive on performance • get invoked only when an error is detected (rare scenario) • trade-off performance for more cost efficient techniques
M V M M ECC bits R1 R2D1 R3D2 D3 D4 R4D5 D6 D7 D8 Data bits Traditional Defect-Tolerant Techniques • Traditional techniques for designing defect-tolerant systems: • Triple Modular Redundancy (TMR) • Forward recovery • Applicable to both combinational and sequential logic • Can not tolerate more than one defective modules • Area and power overhead ~ 3X • Error Correction Codes (ECC) • Lower overhead solution • Applicable only for state holding structures and busses
Error FLIT Error Detection: Low-Cost Domain Specific Technique • The synthesized netlist of the added components account for ~10% of the total switch area • Provide error detection for both hard and soft errors Routing Logic Header CRC Checker CRC Checker Cross-bar Input Buffers CRC Cross-bar Controller Buffer Checker Routing Logic ARB ARB
A A F F I I B B G G C C J J D D H H E E Adding Defect Resiliency With Lower Cost • Automatic Cluster Decomposition • Balanced recursive min-cut heuristic algorithm Input: a) design’s gate-level netlist b) number of partitions Output: a partitioned netlist Goal: • Balance partition sizes: - smaller partition higher resilience • Minimize cut edges: - reduce cost overhead - reduce vulnerable logic • Partitions can have both combinational and sequential logic
A A F F SPF – Defect Tolerance B B I I G G 7.6X more defects tolerated per unit area C C J J D D H H E E Partition Sparing – Silicon Protection Factor • Partition sparing: • Only one spare is active for each partition of the switch • Replace voting logic with spare swapping logic • Lower power overhead • A defect is fatal if it hits the last spare of a partition or the spare swapping logic Silicon Protection Factor (SPF) = • The number of defect in a design are proportional to the design’s area • Enables to compare different defect tolerant designs 15.8X more defects tolerated 1 extra spare per partition Mean Defects to Failure Area Overhead
System Recovery a: Correctly routed flit • Add a Recovery Pointer to each input buffer • Recovery pointers advance 4 cycles after the input controller grants the requesting output channel • Guarantees that flit is CRC checked • On error detection: • All CRC checkers drop outgoing flits • Switch pipeline is flushed • Head pointers are set to recovery pointers • Restart execution b, c: In the switch pipeline d: Next flit to be routed e: Last flit buffered e d Input e d c b a e d c b a Buffers Tail Head Recovery Head Error Detection Signal CRC Checker Routed Flit Routed Routed Interconnect Flit Flit CRC CRC Switch Checker Checker Recovery Routed Logic Flit Routed Flit CRC Checker CRC Checker
System Diagnosis and Repair • Iterative trial-and-error technique • Built-In-Self-Test (BIST) • For each partition keep automatically generated test vectors in ROM • Apply test vectors to each partition through scan chains to locate the defective partition Recover to the last correct state of the switch For partition i swap in the spare for the current copy and restart execution Increase i Yes Error detected? i < # partitions? Yes No No Fatal Defect Continue Execution
Exploring Defect-Tolerant CMP Switch Designs How does these techniques affect the system’s lifetime? 12 partitions (cmps) 2/5 spare input controllers 1 spare per cmp. (rest) Iterative replay Area = 1.76X SPF = 2.53 206 partitions 1 spare per partition Built-In-Self-Test Area = 3.16X SPF = 5.54 12 partitions (cmps) TMR Area = 3.04X SPF = 1.54 206 partitions 2 spares per partition Iterative replay Area = 3.4X SPF = 11.1 206 partitions 1 spare per partition Iterative replay Area = 2.3X SPF = 7.6 Pareto Sub-optimal Designs more robust designs Pareto Optimal Designs cheaper designs cheaper more robust designs
“Bathtub Curve”: A model for semiconductor hard failures • The lifetime failure rate for semiconductor systems follows what is known as the bathtub curve • Trend for future process technologies: • Failure rate of grace period gets larger • Breakdown period is earlier in system’s lifetime Future process technologies Failure Rate (FIT) Time Infant Period Grace Period Breakdown Period
System Lifetime – A Post 65nm Technology Case Scenario 120000 108000 TMR SPF=1.54 3/5 spare IC 1 spare rest SPF=3.01 96000 84000 1 defect every two years 72000 Failure Rate (FIT) 60000 48000 2 spares SPF=11.11 36000 1 spare SPF=7.63 24000 12000
Conclusions – Future Work Conclusions • Traditional mechanisms are insufficient for tolerating moderate numbers of defects • Domain-specific techniques along with resource sparing, iterative diagnosis and reconfiguration are more effective • Decomposing the design into modest-sized partitions is the most effective granularity to apply redundancy Future Work • Use of spare components based on component wear-out profiles • Explore low-cost defect-tolerant techniques for microprocessors