210 likes | 448 Views
Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report. Shubu Mukherjee Joel Emer, Tryggve Fossum, & Steven K. Reinhardt* Fault Aware Computing Technology (FACT) Group Massachusetts Microprocessor Design Center, Intel Corporation
E N D
Cache Scrubbing in Microprocessors: Myth or Necessity?Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum, & Steven K. Reinhardt* Fault Aware Computing Technology (FACT) Group Massachusetts Microprocessor Design Center, Intel Corporation 10th IEEE International Symposium Pacific Rim Dependable Computing, French Polynesia, March 3-5, 2004 * Also, University of Michigan, Ann Arbor
Summary • SECDED ECC (single error correction, double error detection) • commonly used in on-chip caches • interleaving converts spatial multi-bit errors to multiple single bit errors • Scrubbing • periodically read cache blocks and correct all single bit errors • this prevents single bit errors from accumulating, thereby avoiding temporal double bit errors • Our conclusion: given detected error target of 10 year MTTF • Scrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)
Origin of Cosmic Rays p p n n p n n p n p n Earth’s Surface • Cosmic rays come from deep space
source drain Impact of Neutron Strike on a Si Device neutron strike Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device + + - + + - - - Transistor Device • Secondary source of upsets: alpha particles from packaging
Strike Changes State of a Single Bit 0 1 • Example Solution • Error correction codes (ECC) for single bit correction • Overhead = 7 bits for 64 bits of data
0 1 0 1 Strike Changes State of Two Adjacent BitsSpatial Double Bit Error • Example solution • SECDED ECC (single error correction, double error detection) • 8 bits of code per 64 bits of data • Interleaving for the more general case …
/ X + X 0 + / 0 X + / 0 X = covered with single ECC code + = covered with different ECC code Interleaving bits bits • Interleaving converts • spatial multi-bit error multiple single bit errors
Cycle 1,000,000 Cycle 100 Two Separate Strikes on Different BitsTemporal Double Bit Errors • SECDED ECC (single error correction, double error detection) • could detect error, but cannot correct the error • if errors accumulate • single bit correctable error becomes a double bit detectable error
Solutions for Temporal Double Bit Errors • Natural Effects • whenever a processor reads a cache block, we can correct the single bit error • check for errors when cache blocks are replaced from the cache • More Powerful ECC • SECDED ECC requires 8 bits per 64 bits • 7 bits for single bit correction • 8th bit for double bit detection • Overhead = 13% • ECC with two bit correction requires 12 bits per 64 bits • Overhead = 19% • Scrubbing • Periodically read memory and correct all single bit errors • Disallows accumulation of temporal double bit errors • Standard technique in main memories (DRAMs) • Our calculations (later) will assume the worst case for soft errors • cache blocks don’t get scrubbed naturally
Memory Hierarchy of a Processor CPU • Do we need to scrub on-chip caches? • depends on the size of these caches L1 Cache kilobytes L2 Cache megabytes Main Memory (gigabytes)
Cache: 62 FIT + IQ: 100 FIT + FU: 58 FIT Total of 210 FIT Detected Unrecoverable Error (DUE) • Interval-based • MTTF = Mean Time to Failure • E.g., goal = 10 years MTTF for application crash • Bossen, IRPS 2002 • Rate-based • FIT = Failure in Time = 1 failure in a billion hours • 10 year MTTF = 109 / (24 * 365 * 10) FIT = 11,415 FITs Hypothetical Example
Second Strike, Probability = 1 / Q First Strike, Probability = Q / Q MTTF calculations: probabilities • 1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC • Q = # quadwords in cache memory • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike • Pd[1] = 0 • Pd[2] = 1 / Q Pd[2] = (Q/Q) * (1/Q) = 1/Q
MTTF calculations: probabilities Second Strike, Probability = (Q-1) / Q First Strike, Probability = Q / Q Third Strike, Probability = 2/Q • 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC • Q = # quadwords in cache memory • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike • Pd[3] = [ (Q-1)/Q ] * [2/Q] Pd[3] = (Q/Q) * (Q-1/Q) * (2/Q)
MTTF calculations: probabilities • 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC • Q = # quadwords in cache memory • Pd[n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the nth strike • Pd[1] = 0 • Pd[2] = 1 / Q • Pd[3] = [ (Q-1)/Q ] * [2/Q] • Pd[4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q] • … • Pd[n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ]
MTTF calculations: Equation • M = mean # of single bit errors to get a double bit error = Expected value of random variable with Pd[n] as the probability distribution function • M can be easily generated using a computer program • MTTF (double bit error) = M * MTTF (single bit error) • For a 32 megabyte cache & FIT/bit = 0.001 [Normand 1996, Tosaka 1996] • MTTF (double bit error) = M * MTTF (single bit error) = 2567 * (1 / Cache FIT) = 2567 * (109 / (0.001 * 222 * 72 * 24 * 365)) = 970 years • Saleh, et al.’s, 1990 closed form equation • MTTF (double bit error) = [ 1 / (72 * f)] * sqrt( / 2Q) = 970 years, f = FIT/bit
Temporal Double BitMTTF variations with cache size • FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) • higher at higher altitudes (e.g., 3-5x at 1.5km in Denver) • Temporal double bit error has very small contribution to DUE rate • compared to a goal of 10 years DUE MTTF
I I I MTTF with Scrubbing • I = scrubbing interval, scrub at the end of each interval I • N = # scrubbing intervals to reach MTTF = Expected value of random variable with probability distribution function: (1-pf)N * pf, where pf = probability of a temporal double bit error at the end of an interval Assuming 16 GB cache, FIT/bit = 0.001 (Normand 1996, Tosaka 1996), scrub once a year (I = 1 year) • MTTF(double bit error) = N * I = 2281 * 1 = 2281 years • Saleh, et al. 1990 closed form equation • 2 / [Q * I * (f * 72)2] = 2341 years, f = FIT/bit
16 Gigabyte Cache Impact of Scrubbing on Temporal Double Bit MTTF • FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) • higher at higher altitudes (e.g., 3-5x at 1.5km in Denver) • For 16 gigabytes of cache, scrubbing can help • compared to a DUE MTTF goal of 10 years
Summary • SECDED ECC (single error correction, double error detection) • commonly used in on-chip caches • interleaving converts spatial multi-bit errors to multiple single bit errors • Scrubbing • periodically read cache blocks and correct all single bit errors • this prevents single bit errors from accumulating, thereby avoiding temporal double bit errors • Our conclusion: given detected error target of 10 year MTTF • Scrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)
Raw soft error rate: 0.001 – 0.010 FIT/bit • Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” VLSI Symposium on VLSI Technology Digest of Technical Papers, 1996. • Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.