280 likes | 427 Views
Rajaraman R Jie Hu. An Architectural framework for evaluating impact of soft errors in arithmetic units. Overview. Introduction Circuit level estimation (Q critical ) Single bit adders Four bit adders Results and optimizations Converting Q critical to SER Architectural simulations
E N D
Rajaraman R Jie Hu An Architectural framework for evaluating impact of soft errors in arithmetic units
Overview • Introduction • Circuit level estimation (Qcritical) • Single bit adders • Four bit adders • Results and optimizations • Converting Qcritical to SER • Architectural simulations • Results and solutions • Conclusion and future work
Introduction • Data paths and combinational logic are inherently resistant to soft errors due to [shivakumar’02]: • Logical masking ( effect remains same as technology scales) • Electrical masking (effect reduces) • Latching window masking (effect reduces) • Their susceptibility is increasing as: • Pipeline depth increases & • Devices scale
Introduction • In this work we present: • Circuit level estimation of soft errors for: • Single bit adders • Four bit adders • Discuss solutions based on concurrent error detection and other solutions. • Architectural simulations (Jie Hu) • Architectural solutions for data path error detection and correction
Circuit level estimation • QCritical estimated by having a FF at output • Here we estimate the Qcritical for: • Single bit adders • Mirror adder • Transmission Gate based adder • Half adder based full adder • XOR based full adder • Four bit adders • Ripple carry adder • Carry skip adder • Prefix adder (Brent-kung)
V DD V A V DD DD A B A C B B B i Kill "0"-Propagate A C i C o S C i C A i "1"-Propagate Generate A B A A B C B i B Single bit adders Mirror adder Transmission gate adder Nodes evaluated for Qcritical
A B Cin SUM Cout Single bit adders Half adder based FA XOR based FA Cin A G G B P Nodes evaluated for Qcritical
Four bit adders • Ripple carry adder • Flip at the lowest FA cell, will take very high Qcritical to affect the MSB • But it affects all sums in worst case scenario • May often result in multi bit errors
Four bit adders • Carry skip (bypass) adder • Has a faster block propagate signal and logic which will have lower Qcritical • But lower multi-bit errors
Four bit adders • Brent-kung adder • Qcritical for S3 might be lower than RCA but higher than CSA for worst case scenario • Trade-off between multi-bit errors and Qcritical value could be studied for different prefix adder designs 0 1 2 3 S S S S ) ) ) ) 1 2 0 3 B B B B , , , , 0 1 2 3 A A A A ( ( ( (
Results HA based FA mirror TG based
Results mirror HA based FA TG based
Optimization techniques • Concurrent error detection techniques will work well for these adder designs • [Mitra’00] proposes that design diversity in designs results in more robust designs • With the existing trade-offs in the various adder designs, diversity could be used to build robust CED design. • Other techniques include : • Arithmetic coding techniques like carry checking/parity prediction adders [Nicolaidis’03] • Other redundancy techniques like time redundancy [Nicolaidis’99]
Converting Qcritical to SER • We know: • SER α Nflux * CS*exp (Qcritical /Qs) [Hazucha, 2000] • Nflux- Neutron Flux (difficult to find) • CS- Cross Sectional area • Qcritical – Critical charge necessary for a Bit Flip • Qs – Charge Collection Efficiency (difficult to find) • Thus only Qcritical is easiest to determine!! • Working on finding other metrics to find SER …
References (Circuits) • [Nicolaidis’99] Nicolaidis, M.; “Time redundancy based soft-error tolerance to rescue nanometer technologies”, Proceedings of 17th IEEE VLSI Test Symposium, 25-29 April 1999 Page(s): 86 -94 • [Nicolaidis’03] Nicolaidis, M.; “Carry checking/parity prediction adders and ALUs” IEEE Transactions on Very Large Scale Integration (VLSI) Systems,, Volume: 11 Issue: 1 , Feb. 2003 Page(s): 121 -128 • [Mitra’00] Mitra, S.; McCluskey, E.J.; “Which concurrent error detection scheme to choose ?” Proceedings of international Test Conference, 3-5 Oct. 2000 Page(s): 985 -994 • [Shivakumar ’02] Shivakumar, P.; Kistler, M.; Keckler, S.W.; Burger, D.; Alvisi, L.; “Modeling the effect of technology trends on the soft error rate of combinational logic” Proceedings of International Conference on Dependable Systems and Networks, 23-26 June 2002 Page(s): 389 -398 • [Hazucha, 2000] Hazucha P.; and Svensson C.; “Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate” IEEE Transactions on Nuclear Science, Vol. 47, No. 6, Dec. 2000.
Evaluate the Impact of Soft Errors on Processor Datapath---- A Focus of Int. FUs
Motivation • Soft error – a big reliability problem in processor design • Processor components are more susceptible to soft errors in new technology • Cache structures can be well protected by parity, ECC, etc. • Combinational logic: time/space redundancy • Plenty of work on error detection/recovery [6][5][4] • How soft errors in combinational logic affect the system? • Any better cost-effective reliable designs?
A Focus on Functional Units • Integer Functional Units: have a wide range of impact on program execution • Conditions of branches • Addresses of data references • Addresses of function references • Floating-point Functional Units • Mostly for numerical operations • Less impact on the execution of program • Today’s focus: Int. ALU (Adder, Logic), Int. MULT/DIV
Error Injection Scheme • Error Injection based on hardware: • Need circuit details of functional units • Diff. processors may use diff. design styles • Difficult to get the error-infected results at architectural level • A more effective way • Introduce soft errors at one of its source operands • Restore the original source operand value if the result reg. no is diff. from that source reg. • Experimental scheme • Only consider SEU (single event upset) • Simulating a maximum 0.5 Billion committed inst.
Addition Operations • Inject soft errors to addition operations at a fixed interval (10,000 cycles) till program execution crashes • Error accumulation results in program crashes • Different applications have different resistance to errors • Additional exp.: Single error at diff. cycle time didn’t crash
Addition: Uniform Error Rate • Introduce soft errors at different uniformly distributed probabilities • For all benchmarks, error rate of 0.0001 is the most sensitive point
Logic: Uniform Error Rate • 175.vpr (FPGA Placement and Routing) is more sensitive to errors happened during logic operations • 256.bzip2 (Compression) can survive from large number of errors
ALU: Uniform Error Rate • A combinational effect of errors in both addition and logic operations • All benchmarks show an exacerbated behavior except 175.vpr
MULT/DIV: Uniform Error Rate • In general, programs can still survive from errors happened in MULT/DIV operations due to their less number and less relationship to the program execution control.
Conclusions and Ongoing Work • Conclusions: • Errors in different Int. operations have different impact on the program execution • Different programs have different behavior under error injection • Control-intensive (lower IPB) applications are more sensitive to logic operation errors • Multiplication/Division operations have less impact on program execution • Future work • More detailed characterization of program behavior under error impact • Modeling the soft error rate from Qcritical for arithmetic units… • Use the above information to develop some selective error protection/detection/recovery schemes…
References • [1] Ghani A. Kanawati, Nasser A. Kanawati, and Jacob A. Abraham. FERRARI: A Flexible Software-Based Fault and Error Injection System. IEEE Transactions on Computers, 44(2):248-260, February 1995. • [2] S Mitra and E. J. McCluskey. Which concurrent error detection scheme to choose ? In Proceedings of International Test Conference, pages 985 - 994, October 2000. • [4] Nahmsuk Oh, Subhasish Mitra, and Edward J. McCluskey. ED4I: Error Detection by Diverse Data and Duplicated Instructions. IEEE Transactions on Computers, 51(2):180-199, February 2002. • [5] Joydeep Ray, James C. Hoe, and Babak Falsa. Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery. In Proc. the 34th Annual International Symposium on Microarchitecture, 2001. • [6] E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in micro- processors. In Proceedings of the 29th Fault-Tolerant Computing Symposium, June 1999. • [8] J. F. Ziegler et al. IBM experiments in soft fails in computer electronics (1978 - 1994). IBM Journal of Research and Development,, 40(1):3-18, 1996.