150 likes | 161 Views
This paper discusses the importance of protecting data in HPC systems to reduce failures due to soft errors. It covers soft error definitions, rates, detection, recovery methods, and related works. Various redundancy techniques like ECC and TMR are explored, along with the impact on cost, performance, and power consumption. The study emphasizes the criticality of certain data over others and proposes a methodology for selective protection to optimize reliability, power efficiency, and performance in HPC environments.
E N D
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee
Contents • Motivation • Previous Work • Current Work • Next Step
Soft Error • What is soft error? • Why is soft error important? • How to recover soft error?
Definition of Soft Error • Soft Error (SE) • Transient Fault = Bit Flip = Single Event Upset (SEU) • A charged particle strikes electronic circuits and changes the amount of charge stored at sensitive nodes, hence affects the logic state (e.g.: ‘0’ to ‘1’ or vice versa) • Random, non-catastrophic, non-destructive, recoverable • Caused by Radiation • Neutrons • Alpha particles • High-energy cosmic rays • Solar particles Robert Bauman, “Soft Errors in Advanced Computer Systems” in IEEE Design and Test of Computers 2005
Soft Error Rate • Soft Error Rate (SER) • FIT: Failure in Time (one billion hours) • (e.g.) 1,000 FITs per Mbits ≒ 114 years MTTF (Mean Time To Failure) • SER ∝ Nflux * CS * exp{-(Qcritical/Qs)} • Nflux : intensity of the Neutron Flux • CS : the area of the cross section of the node • QS : the charge collection efficiency • Qcritical : the min required charge for a cell to retain data, Qcirtical = C * V where Capacitance (C) and Voltage (V)
Importance of SE • Critical SE • High Integration and Density • e.g.: 1 GB memory with 1,000 FIT per Mbits 8 * 106 FITs/memory 5 days MTTF • Technology Advancements • e.g.: 1,000 FIT per Mbits in 0.18 µm tech 10,000 to 100,000 FIT per Mbits in 0.13 µm tech • Latitude and Altitude • e.g.: 10 to 100 times higher SER at flight than at ground • Voltage Scaling • e.g.: lower voltage decreases Qcritical, which increases SER exponentially
SER Trend C. Core Logic B. SRAM A. DRAM D. Contributions in Processors Robert Bauman, “Soft Errors in Advanced Computer Systems” in IEEE Design and Test of Computers 2005 S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” IEEE Computer 2005
SE Detection and Recovery • Information Redundancy • E.g.: ECC (Error Correction Coding) and Parity • Hardware Redundancy • E.g.: TMR (Triple Modular Redundancy) • Temporal Redundancy • E.g.: Checkpointing and Recovery • Effects of Redundancy on Cost, Performance and Power • E.g.: ECC • implemented by Hamming Code (250 nm) • Coding/Decoding modules and extra bits • 1.45 ns for Coding and 2.66 ns for Decoding • 14.5 mW for Coding and 26.3 mW for Decoding Coding Data Extra Decoding L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft Error and Energy Consumption Interactions: A Data Cache Perspective,” Proc. of ISLPED, pp. 132-137, 2004
Related Works • Reliability and Power Management • Dr. D. Mossé group in Univ. of Pittsburgh • D. Zhu, R. Melhem, and D. Mossé, “The Effects of Energy Management on Reliability in Real-Time Embedded Systems,” Proc. of ICCAD, Nov. 2004. • D. Zhu, R. Melhem, D. Mossé, and E. Elnozahy, “Analysis of an Energy Efficient Optimistic TMR Scheme,” Proc. of ICPDS, Jul. 2004. • Dr. G. De Micheli group in Stanford Univ. • K. Mihic, T. Simunic, and G. De Micheli, “Reliability and Power Management of Integrated Systems,” Proc. of EuroMicro Systems on Digital System Design, 2004. • T. Simunic, K. Mihic, and G. De Micheli, “Optimization of Reliability and Power Consumption in Systems on a Chip,” Proc. of PATMOS, 2005. • (Cache) Architecture • Dr. M. J. Irwin and Dr. N. Vijaykrishnan group in PSU • L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft Error and Energy Consumption Interactions: A Data Cache Perspective,” Proc. of ISLPED, pp. 132-137, 2004 • Dr. S. M. Reddy group in Univ. of Iowa • Y. Cai, M. T. Schmitz, A. Ejlali, B. M. Al-Hashimi & S. M. Reddy“Cache Size Selection for Performance, Energy and Reliability of Time-Constrained Systems" in ASP-DAC 2006 • Soft Error and Core Logic • Intel • S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” IEEE Computer, pp. 43-51, 2005 • Dr. K. Roy group in Purdue Univ. • A. Goel, S. Bhunia, H. Mahmoodi and K. Roy, “Low-Overhead Design of Soft-Error-Tolerant Scan Flip-Flops with Enhanced-Scan Capability”, in ASP-DAC2006
Overhead of ECC in Cache • Power: up to 22% power overhead • Performance: 95% overhead of access time • Area: more than 25% area overhead Coding Data Data Extra Decoding Protected Cache Unprotected Cache
Our Approach • HPC presents low performance overhead as well as high energy saving • All the data are not equally critical for failures • eg: Pixel data in video applications are not important for quality • and reliability while quantization value is more important • Provide the comparative Reliability • to Protected $ with small power • and performance overheads Coding Coding Data Data Extra Data Data Extra Decoding Decoding Protected Cache Unprotected Cache Selective Data Protection HPC
Previous Work • “Mitigating Soft Error Failures for Multimedia Applications using Selective Data Protection” submitted to CASES 2006 • Put Multimedia Data on SE-unprotected Main cache and the others on SE-protected mini cache • Present the comparative failure rates to those of only SE-protected cache with significant reduction with respect to energy, area and performance
Current Work • Main objective is how to extend this idea for general applications • How to partition data into important and not-important ones • Intensive simulation study • Cache Active Time: how long the page stays on cache • The longer the cache active time, the more vulnerable to soft errors • Cache Access: how many the page is accessed • The more the page is accessed, the more the page affect the application resulting in failures
Next Step • More Simulations • Strong Metric • Combined one with Cache Active Time and Cache Access • Temporal and Behavioral Analysis to support this • Writing a paper for DATE 2007