1 / 16

Assessing SEU Vulnerability via Circuit-Level Timing Analysis

Kypros Constantinides ‡ Stephen Plaza ‡ Jason Blome ‡ Bin Zhang † Valeria Bertacco ‡ Scott Mahlke ‡ Todd Austin ‡ Michael Orshansky † ‡ Advanced Computer Architecture Lab † Department of Electrical and Computer Engineering

lanai
Download Presentation

Assessing SEU Vulnerability via Circuit-Level Timing Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kypros Constantinides‡ Stephen Plaza‡ Jason Blome‡ Bin Zhang† Valeria Bertacco‡ Scott Mahlke‡ Todd Austin‡ Michael Orshansky† ‡Advanced Computer Architecture Lab †Department of Electrical and Computer Engineering University of Michigan University of Texas at Austin Assessing SEU Vulnerabilityvia Circuit-Level Timing Analysis

  2. Introduction • Recently there is a growing concern about transient faults in combinational logic • Numerous techniques already exist that deal with the effects of transient faults: • Error Correction Codes (ECC) • DIVA • Simultaneous Redundantly Threading (SRT) • and many other… • However, these techniques come with a cost on performance, power, die size and design time.

  3. Introduction • Designers have to trade-off between reliability provided and implementation cost Inadequate soft-error protection maybe useless due to poor reliability Excessive soft-error protection uncompetitive in cost and/or performance • In order to balance this trade-off, system designers need accurate SERs (Soft-Error Rate) for their designs • The device community provides raw SERs for devices of current technologies and projections for devices of future technologies • However, architecture-level and circuit-level phenomena derate the raw SER • Accurately assessing a design’s SER requires circuit-level detail analysis infrastructure

  4. In This Work… • We introduce a high-fidelity, high-performance simulation infrastructure for estimating soft-error rates • asynchronously injects voltage pulses of various durations at the gate level • accurately gauge detailed circuit phenomena to model: • fault introduction • fault propagation • and possible fault masking • simulates with sufficient speed permitting the examination of entire workloads on complex designs (thousands of gates)

  5. Soft Error Masking • Fortunately not all transient faults cause an error • Circuit and architectural phenomena prevent the fault from propagating to the design’s output and causing an error • Logic masking • Timing masking • Electrical masking • Microarchitecture masking • Software masking

  6. Soft Error Masking Logic Masking: the fault gets blocked by a following gate whose output is completely determined by its other inputs Timing Masking: the fault affects the input of a latch only in the period of time that the latch is not sensitive to its input Electrical Masking: the fault’s pulse is attenuated by subsequent logic gates due to electrical properties, and does not affect any latch’s input Microarchitectural Masking: the fault alters a value of at least one flip-flop, but the incorrect values get overwritten without being used in any computation affecting the design’s output Software Masking: the fault propagates to the design’s output but is subsequently masked by software without affecting the application’s correct execution

  7. Simulation Infrastructure • Design Under Test: gate-level description of the design (netlist) • - Fault-Exposed Model: subjected to fault injection • - Golden Model: no fault injected Model Stimuli: Workload traces that exercise the design under test Fault Generator: injects voltage pulses of various durations at any gate in the design and flips the value of any flip-flop in the design - faults are uniformly distributed at time, location and duration Fault Analyzer: Monitors manifested errors and tracks all the possible ways a fault can be masked

  8. Statistical Model for Transient Faults • Pulse-based model for transient faults caused by energetic particle strikes • Faults injected into combinational logic are classified based on their duration • 20%, 40%, 60%, 80% and 100% of design’s clock period • Faults injected into sequential elements flip their value • The arrival rate of each type of fault is modeled by a separate random variable • The mean inter-arrival times for each fault type are derived by previously published data and detailed SPICE simulations

  9. Design Under Test – CMP Switch • We chose as a design under test a single chip multiprocessor interconnection switch (baseline provided by Li-Shiuan Peh) • Much less complex than a microprocessor yet not too simplistic (it includes finite state machines, buffers, control logic, and buses) • Wormhole switch pipelined at the flit level • Specified in Verilog and synthesized to a gate-level netlist ~ 9K logic gates and 1700 sequential elements • Realistic workload • Communication traces derived from the TRIPS architecture

  10. 51.7% logic masking 2.2% timing masking 42.9% μarch masking 3.2% error Characterization per Fault Type • High microarchitectural masking • 95% of the faults that flip a flip-flop’s value are masked • Timing masking is significant only for faults with small pulse durations • Logic masking is increasing as the fault’s pulse duration is decreasing

  11. Derating Factor • Derating factor = error rate-1 • i.e. a derating factor of 30 means that one of every 30 injected faults will cause an error (corresponds to an error rate of 3.3%) • Average derating factor for realistic workloads is 31 • Synthetic high utilization workload leads to a derating factor of 12 error rate: 3.2% error rate: 8.3%

  12. Failure Rate Projections • Taking into account projections from ITRS and raw SER estimates for future process technologies, we make failure rate projections considering the transient-fault derating effects • Design architecture is kept intact for future process technologies • Two different designs: • one clocked with the projected clock frequencies for microprocessors • and one clocked with the projected clock frequencies for interconnection networks

  13. Transient-fault Vulnerability per Component • We observed that each switch component exhibited different vulnerability on transient faults • Derating effects greatly depend on the component’s characteristics • Most vulnerable component • Switch Arbiter (12.8% error) • 6% of switch’s area • Input Controllers • dominate switch design • 86% of switch’s area • The switch’s vulnerability match with that of input controllers

  14. Effects of Multi-fault Strikes • A single strike causes multiple faults on neighbouring gates or flip-flops • lack of data about frequency of such events or models for multi-fault strikes on logic gates and flip-flops • we assume that each strike causes multiple faults • extremely pessimistic • even under this severe environment the failure rates are relatively low

  15. Conclusions – Directions for Future Work Conclusions • For complex designs there is significant fault masking, with derating factors as high as 30 • Soft-error derating effects highly depend on the design’s characteristics and utilization • Our observations suggest that the soft-error reliability threat might have been overstated by the computer architecture community • Designers need to evaluate their design’s soft-error tolerance with detail analysis tools considering circuit level derating effects and better trade-off between the protection provided and the implementation cost Future Work • Study the soft-error derating effects for several designs with different amount of complexity and different characteristics • Enhance our simulation infrastructure to be able to simulate large high-complexity systems (millions of gates) with short simulation runs

  16. Questions?

More Related