1 / 15

Selective Data Protection in High-Performance Computing for Reduced Soft Errors

This paper discusses the importance of protecting data in HPC systems to reduce failures due to soft errors. It covers soft error definitions, rates, detection, recovery methods, and related works. Various redundancy techniques like ECC and TMR are explored, along with the impact on cost, performance, and power consumption. The study emphasizes the criticality of certain data over others and proposes a methodology for selective protection to optimize reliability, power efficiency, and performance in HPC environments.

trotter
Download Presentation

Selective Data Protection in High-Performance Computing for Reduced Soft Errors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee

  2. Contents • Motivation • Previous Work • Current Work • Next Step

  3. Soft Error • What is soft error? • Why is soft error important? • How to recover soft error?

  4. Definition of Soft Error • Soft Error (SE) • Transient Fault = Bit Flip = Single Event Upset (SEU) • A charged particle strikes electronic circuits and changes the amount of charge stored at sensitive nodes, hence affects the logic state (e.g.: ‘0’ to ‘1’ or vice versa) • Random, non-catastrophic, non-destructive, recoverable • Caused by Radiation • Neutrons • Alpha particles • High-energy cosmic rays • Solar particles Robert Bauman, “Soft Errors in Advanced Computer Systems” in IEEE Design and Test of Computers 2005

  5. Soft Error Rate • Soft Error Rate (SER) • FIT: Failure in Time (one billion hours) • (e.g.) 1,000 FITs per Mbits ≒ 114 years MTTF (Mean Time To Failure) • SER ∝ Nflux * CS * exp{-(Qcritical/Qs)} • Nflux : intensity of the Neutron Flux • CS : the area of the cross section of the node • QS : the charge collection efficiency • Qcritical : the min required charge for a cell to retain data, Qcirtical = C * V where Capacitance (C) and Voltage (V)

  6. Importance of SE • Critical SE • High Integration and Density • e.g.: 1 GB memory with 1,000 FIT per Mbits  8 * 106 FITs/memory  5 days MTTF • Technology Advancements • e.g.: 1,000 FIT per Mbits in 0.18 µm tech  10,000 to 100,000 FIT per Mbits in 0.13 µm tech • Latitude and Altitude • e.g.: 10 to 100 times higher SER at flight than at ground • Voltage Scaling • e.g.: lower voltage decreases Qcritical, which increases SER exponentially

  7. SER Trend C. Core Logic B. SRAM A. DRAM D. Contributions in Processors Robert Bauman, “Soft Errors in Advanced Computer Systems” in IEEE Design and Test of Computers 2005 S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” IEEE Computer 2005

  8. SE Detection and Recovery • Information Redundancy • E.g.: ECC (Error Correction Coding) and Parity • Hardware Redundancy • E.g.: TMR (Triple Modular Redundancy) • Temporal Redundancy • E.g.: Checkpointing and Recovery • Effects of Redundancy on Cost, Performance and Power • E.g.: ECC • implemented by Hamming Code (250 nm) • Coding/Decoding modules and extra bits • 1.45 ns for Coding and 2.66 ns for Decoding • 14.5 mW for Coding and 26.3 mW for Decoding Coding Data Extra Decoding L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft Error and Energy Consumption Interactions: A Data Cache Perspective,” Proc. of ISLPED, pp. 132-137, 2004

  9. Related Works • Reliability and Power Management • Dr. D. Mossé group in Univ. of Pittsburgh • D. Zhu, R. Melhem, and D. Mossé, “The Effects of Energy Management on Reliability in Real-Time Embedded Systems,” Proc. of ICCAD, Nov. 2004. • D. Zhu, R. Melhem, D. Mossé, and E. Elnozahy, “Analysis of an Energy Efficient Optimistic TMR Scheme,” Proc. of ICPDS, Jul. 2004. • Dr. G. De Micheli group in Stanford Univ. • K. Mihic, T. Simunic, and G. De Micheli, “Reliability and Power Management of Integrated Systems,” Proc. of EuroMicro Systems on Digital System Design, 2004. • T. Simunic, K. Mihic, and G. De Micheli, “Optimization of Reliability and Power Consumption in Systems on a Chip,” Proc. of PATMOS, 2005. • (Cache) Architecture • Dr. M. J. Irwin and Dr. N. Vijaykrishnan group in PSU • L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft Error and Energy Consumption Interactions: A Data Cache Perspective,” Proc. of ISLPED, pp. 132-137, 2004 • Dr. S. M. Reddy group in Univ. of Iowa • Y. Cai, M. T. Schmitz, A. Ejlali, B. M. Al-Hashimi & S. M. Reddy“Cache Size Selection for Performance, Energy and Reliability of Time-Constrained Systems" in ASP-DAC 2006 • Soft Error and Core Logic • Intel • S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” IEEE Computer, pp. 43-51, 2005 • Dr. K. Roy group in Purdue Univ. • A. Goel, S. Bhunia, H. Mahmoodi and K. Roy, “Low-Overhead Design of Soft-Error-Tolerant Scan Flip-Flops with Enhanced-Scan Capability”, in ASP-DAC2006

  10. Overhead of ECC in Cache • Power: up to 22% power overhead • Performance: 95% overhead of access time • Area: more than 25% area overhead Coding Data Data Extra Decoding Protected Cache Unprotected Cache

  11. Our Approach • HPC presents low performance overhead as well as high energy saving • All the data are not equally critical for failures • eg: Pixel data in video applications are not important for quality • and reliability while quantization value is more important • Provide the comparative Reliability • to Protected $ with small power • and performance overheads Coding Coding Data Data Extra Data Data Extra Decoding Decoding Protected Cache Unprotected Cache Selective Data Protection HPC

  12. Previous Work • “Mitigating Soft Error Failures for Multimedia Applications using Selective Data Protection” submitted to CASES 2006 • Put Multimedia Data on SE-unprotected Main cache and the others on SE-protected mini cache • Present the comparative failure rates to those of only SE-protected cache with significant reduction with respect to energy, area and performance

  13. Current Work • Main objective is how to extend this idea for general applications • How to partition data into important and not-important ones • Intensive simulation study • Cache Active Time: how long the page stays on cache • The longer the cache active time, the more vulnerable to soft errors • Cache Access: how many the page is accessed • The more the page is accessed, the more the page affect the application resulting in failures

  14. Preliminary Results

  15. Next Step • More Simulations • Strong Metric • Combined one with Cache Active Time and Cache Access • Temporal and Behavioral Analysis to support this • Writing a paper for DATE 2007

More Related