160 likes | 409 Views
Kypros Constantinides ‡ Stephen Plaza ‡ Jason Blome ‡ Bin Zhang † Valeria Bertacco ‡ Scott Mahlke ‡ Todd Austin ‡ Michael Orshansky † ‡ Advanced Computer Architecture Lab † Department of Electrical and Computer Engineering
E N D
Kypros Constantinides‡ Stephen Plaza‡ Jason Blome‡ Bin Zhang† Valeria Bertacco‡ Scott Mahlke‡ Todd Austin‡ Michael Orshansky† ‡Advanced Computer Architecture Lab †Department of Electrical and Computer Engineering University of Michigan University of Texas at Austin Assessing SEU Vulnerabilityvia Circuit-Level Timing Analysis
Introduction • Recently there is a growing concern about transient faults in combinational logic • Numerous techniques already exist that deal with the effects of transient faults: • Error Correction Codes (ECC) • DIVA • Simultaneous Redundantly Threading (SRT) • and many other… • However, these techniques come with a cost on performance, power, die size and design time.
Introduction • Designers have to trade-off between reliability provided and implementation cost Inadequate soft-error protection maybe useless due to poor reliability Excessive soft-error protection uncompetitive in cost and/or performance • In order to balance this trade-off, system designers need accurate SERs (Soft-Error Rate) for their designs • The device community provides raw SERs for devices of current technologies and projections for devices of future technologies • However, architecture-level and circuit-level phenomena derate the raw SER • Accurately assessing a design’s SER requires circuit-level detail analysis infrastructure
In This Work… • We introduce a high-fidelity, high-performance simulation infrastructure for estimating soft-error rates • asynchronously injects voltage pulses of various durations at the gate level • accurately gauge detailed circuit phenomena to model: • fault introduction • fault propagation • and possible fault masking • simulates with sufficient speed permitting the examination of entire workloads on complex designs (thousands of gates)
Soft Error Masking • Fortunately not all transient faults cause an error • Circuit and architectural phenomena prevent the fault from propagating to the design’s output and causing an error • Logic masking • Timing masking • Electrical masking • Microarchitecture masking • Software masking
Soft Error Masking Logic Masking: the fault gets blocked by a following gate whose output is completely determined by its other inputs Timing Masking: the fault affects the input of a latch only in the period of time that the latch is not sensitive to its input Electrical Masking: the fault’s pulse is attenuated by subsequent logic gates due to electrical properties, and does not affect any latch’s input Microarchitectural Masking: the fault alters a value of at least one flip-flop, but the incorrect values get overwritten without being used in any computation affecting the design’s output Software Masking: the fault propagates to the design’s output but is subsequently masked by software without affecting the application’s correct execution
Simulation Infrastructure • Design Under Test: gate-level description of the design (netlist) • - Fault-Exposed Model: subjected to fault injection • - Golden Model: no fault injected Model Stimuli: Workload traces that exercise the design under test Fault Generator: injects voltage pulses of various durations at any gate in the design and flips the value of any flip-flop in the design - faults are uniformly distributed at time, location and duration Fault Analyzer: Monitors manifested errors and tracks all the possible ways a fault can be masked
Statistical Model for Transient Faults • Pulse-based model for transient faults caused by energetic particle strikes • Faults injected into combinational logic are classified based on their duration • 20%, 40%, 60%, 80% and 100% of design’s clock period • Faults injected into sequential elements flip their value • The arrival rate of each type of fault is modeled by a separate random variable • The mean inter-arrival times for each fault type are derived by previously published data and detailed SPICE simulations
Design Under Test – CMP Switch • We chose as a design under test a single chip multiprocessor interconnection switch (baseline provided by Li-Shiuan Peh) • Much less complex than a microprocessor yet not too simplistic (it includes finite state machines, buffers, control logic, and buses) • Wormhole switch pipelined at the flit level • Specified in Verilog and synthesized to a gate-level netlist ~ 9K logic gates and 1700 sequential elements • Realistic workload • Communication traces derived from the TRIPS architecture
51.7% logic masking 2.2% timing masking 42.9% μarch masking 3.2% error Characterization per Fault Type • High microarchitectural masking • 95% of the faults that flip a flip-flop’s value are masked • Timing masking is significant only for faults with small pulse durations • Logic masking is increasing as the fault’s pulse duration is decreasing
Derating Factor • Derating factor = error rate-1 • i.e. a derating factor of 30 means that one of every 30 injected faults will cause an error (corresponds to an error rate of 3.3%) • Average derating factor for realistic workloads is 31 • Synthetic high utilization workload leads to a derating factor of 12 error rate: 3.2% error rate: 8.3%
Failure Rate Projections • Taking into account projections from ITRS and raw SER estimates for future process technologies, we make failure rate projections considering the transient-fault derating effects • Design architecture is kept intact for future process technologies • Two different designs: • one clocked with the projected clock frequencies for microprocessors • and one clocked with the projected clock frequencies for interconnection networks
Transient-fault Vulnerability per Component • We observed that each switch component exhibited different vulnerability on transient faults • Derating effects greatly depend on the component’s characteristics • Most vulnerable component • Switch Arbiter (12.8% error) • 6% of switch’s area • Input Controllers • dominate switch design • 86% of switch’s area • The switch’s vulnerability match with that of input controllers
Effects of Multi-fault Strikes • A single strike causes multiple faults on neighbouring gates or flip-flops • lack of data about frequency of such events or models for multi-fault strikes on logic gates and flip-flops • we assume that each strike causes multiple faults • extremely pessimistic • even under this severe environment the failure rates are relatively low
Conclusions – Directions for Future Work Conclusions • For complex designs there is significant fault masking, with derating factors as high as 30 • Soft-error derating effects highly depend on the design’s characteristics and utilization • Our observations suggest that the soft-error reliability threat might have been overstated by the computer architecture community • Designers need to evaluate their design’s soft-error tolerance with detail analysis tools considering circuit level derating effects and better trade-off between the protection provided and the implementation cost Future Work • Study the soft-error derating effects for several designs with different amount of complexity and different characteristics • Enhance our simulation infrastructure to be able to simulate large high-complexity systems (millions of gates) with short simulation runs