Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report

Cache Scrubbing in Microprocessors: Myth or Necessity?Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum, & Steven K. Reinhardt* Fault Aware Computing Technology (FACT) Group Massachusetts Microprocessor Design Center, Intel Corporation 10th IEEE International Symposium Pacific Rim Dependable Computing, French Polynesia, March 3-5, 2004 * Also, University of Michigan, Ann Arbor

Summary • SECDED ECC (single error correction, double error detection) • commonly used in on-chip caches • interleaving converts spatial multi-bit errors to multiple single bit errors • Scrubbing • periodically read cache blocks and correct all single bit errors • this prevents single bit errors from accumulating, thereby avoiding temporal double bit errors • Our conclusion: given detected error target of 10 year MTTF • Scrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)

Origin of Cosmic Rays p p n n p n n p n p n Earth’s Surface • Cosmic rays come from deep space

source drain Impact of Neutron Strike on a Si Device neutron strike Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device + + - + + - - - Transistor Device • Secondary source of upsets: alpha particles from packaging

Strike Changes State of a Single Bit 0 1 • Example Solution • Error correction codes (ECC) for single bit correction • Overhead = 7 bits for 64 bits of data

0 1 0 1 Strike Changes State of Two Adjacent BitsSpatial Double Bit Error • Example solution • SECDED ECC (single error correction, double error detection) • 8 bits of code per 64 bits of data • Interleaving for the more general case …

/ X + X 0 + / 0 X + / 0 X = covered with single ECC code + = covered with different ECC code Interleaving bits bits • Interleaving converts • spatial multi-bit error  multiple single bit errors

Cycle 1,000,000 Cycle 100 Two Separate Strikes on Different BitsTemporal Double Bit Errors • SECDED ECC (single error correction, double error detection) • could detect error, but cannot correct the error • if errors accumulate • single bit correctable error becomes a double bit detectable error

Solutions for Temporal Double Bit Errors • Natural Effects • whenever a processor reads a cache block, we can correct the single bit error • check for errors when cache blocks are replaced from the cache • More Powerful ECC • SECDED ECC requires 8 bits per 64 bits • 7 bits for single bit correction • 8th bit for double bit detection • Overhead = 13% • ECC with two bit correction requires 12 bits per 64 bits • Overhead = 19% • Scrubbing • Periodically read memory and correct all single bit errors • Disallows accumulation of temporal double bit errors • Standard technique in main memories (DRAMs) • Our calculations (later) will assume the worst case for soft errors • cache blocks don’t get scrubbed naturally

Memory Hierarchy of a Processor CPU • Do we need to scrub on-chip caches? • depends on the size of these caches L1 Cache kilobytes L2 Cache megabytes Main Memory (gigabytes)

Cache: 62 FIT + IQ: 100 FIT + FU: 58 FIT Total of 210 FIT Detected Unrecoverable Error (DUE) • Interval-based • MTTF = Mean Time to Failure • E.g., goal = 10 years MTTF for application crash • Bossen, IRPS 2002 • Rate-based • FIT = Failure in Time = 1 failure in a billion hours • 10 year MTTF = 109 / (24 * 365 * 10) FIT = 11,415 FITs Hypothetical Example

Second Strike, Probability = 1 / Q First Strike, Probability = Q / Q MTTF calculations: probabilities • 1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC • Q = # quadwords in cache memory • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike • Pd[1] = 0 • Pd[2] = 1 / Q Pd[2] = (Q/Q) * (1/Q) = 1/Q

MTTF calculations: probabilities Second Strike, Probability = (Q-1) / Q First Strike, Probability = Q / Q Third Strike, Probability = 2/Q • 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC • Q = # quadwords in cache memory • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike • Pd[3] = [ (Q-1)/Q ] * [2/Q] Pd[3] = (Q/Q) * (Q-1/Q) * (2/Q)

MTTF calculations: probabilities • 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC • Q = # quadwords in cache memory • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike • Pd[1] = 0 • Pd[2] = 1 / Q • Pd[3] = [ (Q-1)/Q ] * [2/Q] • Pd[4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q] • … • Pd[n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ]

MTTF calculations: Equation • M = mean # of single bit errors to get a double bit error = Expected value of random variable with Pd[n] as the probability distribution function • M can be easily generated using a computer program • MTTF (double bit error) = M * MTTF (single bit error) • For a 32 megabyte cache & FIT/bit = 0.001 [Normand 1996, Tosaka 1996] • MTTF (double bit error) = M * MTTF (single bit error) = 2567 * (1 / Cache FIT) = 2567 * (109 / (0.001 * 222 * 72 * 24 * 365)) = 970 years • Saleh, et al.’s, 1990 closed form equation • MTTF (double bit error) = [ 1 / (72 * f)] * sqrt( / 2Q) = 970 years, f = FIT/bit

Temporal Double BitMTTF variations with cache size • FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) • higher at higher altitudes (e.g., 3-5x at 1.5km in Denver) • Temporal double bit error has very small contribution to DUE rate • compared to a goal of 10 years DUE MTTF

I I I MTTF with Scrubbing • I = scrubbing interval, scrub at the end of each interval I • N = # scrubbing intervals to reach MTTF = Expected value of random variable with probability distribution function: (1-pf)N * pf, where pf = probability of a temporal double bit error at the end of an interval Assuming 16 GB cache, FIT/bit = 0.001 (Normand 1996, Tosaka 1996), scrub once a year (I = 1 year) • MTTF(double bit error) = N * I = 2281 * 1 = 2281 years • Saleh, et al. 1990 closed form equation • 2 / [Q * I * (f * 72)2] = 2341 years, f = FIT/bit

16 Gigabyte Cache Impact of Scrubbing on Temporal Double Bit MTTF • FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) • higher at higher altitudes (e.g., 3-5x at 1.5km in Denver) • For 16 gigabytes of cache, scrubbing can help • compared to a DUE MTTF goal of 10 years

Summary • SECDED ECC (single error correction, double error detection) • commonly used in on-chip caches • interleaving converts spatial multi-bit errors to multiple single bit errors • Scrubbing • periodically read cache blocks and correct all single bit errors • this prevents single bit errors from accumulating, thereby avoiding temporal double bit errors • Our conclusion: given detected error target of 10 year MTTF • Scrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)

BACKUPS

Raw soft error rate: 0.001 – 0.010 FIT/bit • Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” VLSI Symposium on VLSI Technology Digest of Technical Papers, 1996. • Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.

Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report

Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report

Presentation Transcript

Practical Use of MDA Tables

Cache Coherence Protocols in Shared Memory Multiprocessors

CMPT 300 Introduction to Operating Systems

Rickets

PREPARE AND COOK SEAFOOD

SECME Generator Building Competition

CSCI 4717/5717 Computer Architecture

Absorption

MS108 Computer System I

Today’s Spiritual Myth:

Cache Coherence CS433 Spring 2001

The 8088 and 8086 Microprocessors

Oracle8i Administration

Memory Hierarchy Design

OpenMP

Chapter 21 Cache

Urban Myth of Grantsmanship

Section 2 Microprocessors course Dr. S.O.Fatemi By: Mahdi Hassanpour

Advanced Pipelining

Microprocessors