670 likes | 888 Views
Virtualized and Flexible ECC for Main Memory. Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin. ASPLOS 2010. Memory Error Protection. Applying ECC uniformly – ECC DIMMs Simple and transparent to programmers Error protection level
E N D
Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and MattanErez Dept. Electrical and Computer Engineering The University of Texas at Austin ASPLOS 2010
Memory Error Protection • Applying ECC uniformly – ECC DIMMs • Simple and transparent to programmers • Error protection level • Fixed, design-time decision • Chipkill-correct used in high-end servers • Constrain memory module design space • Allow only x4 DRAMs • Lower energy efficiency than x8 DRAMs • Virtualized ECC – objectives • To provide flexible memory error protection • To relax design constraints of chipkill
Virtualized ECC • Two-tiered error protection • Tier-1 Error Code (T1EC) • Simple error code for detection or light-weight correction • Tier-2 Error Code (T2EC) • Strong error correcting code • Store T2EC within the memory namespace itself • OS manages T2EC • Flexible memory error protection • Different T2EC for different data pages • Stronger protection for more important data
Virtualized ECC – Example Error Protection Level Physical Memory Virtual Address space Page frame – i Virtual page – i Low Virtual Page to Physical Frame mapping Page frame – j Virtual page – j Page frame – k Virtual page – k T2EC for Chipkill High ECC page – j ECC page – k Physical Frame to ECC Page mapping T2EC for Double Chipkill Data T1EC
Observations on Memory Errors • Per-system error rate is still low • Most of time, we try to detect errors finding no error • To detect errors is a common case operation • Need a low latency, low complexity error detection mechanism T1EC • To correct errors is an uncommon case operation • Correction can be complex, take a long time • But, still need to manage error correction info somewhere Virtualized T2EC
Uniform ECC Physical Memory VA VPN offset Page Frame PA Virtual Memory PA PFN offset Data ECC
Virtualized ECC Physical Memory VA VPN offset Page Frame PA Virtual Memory PA PFN offset T2EC OS manages PFN to EPN translation Scale according to T2EC size EA ECC Page ECC Address ECC page number offset Data T1EC
Update only valid T2EC to DRAM Don’t need T2EC in most cases T2EC lines can be partially valid Virtualized ECC operation Read: fetch data and T1EC ECC Address Translation Unit: fast PA to EA translation T2ECs of consecutive data lines map to a T2EC line Write: update data, T1EC, and T2EC 3 PA: 0x0200 B0 ECC Address Translation Unit A 4 EA: 0x0540 0 1 1 2 2 3 3 LLC 2 Wr: 0x0200 5 Wr: 0x0540 DRAM Rank 0 Rank 1 1 Rd: 0x00c0 0040 0000 00c0 0080 A 0140 0100 01c0 0180 0240 B0 0200 02c0 B1 0280 0340 B2 0300 03c0 B3 0380 0440 0400 04c0 0480 T2EC for Rank 1 data T2EC for Rank 0 data 0540 0500 0 05c0 0580 Data T1EC Data T1EC
Penalty with V-ECC • Increased data miss rate • T2EC lines in LLC reduce effective LLC size • Increased traffic due to T2EC write-back • One-way write-back traffic • Not in a critical-path
Chipkill-correct • Single Device-error CorrectDouble Device-error Detect • Can tolerate a DRAM failure • Can detect a second DRAM failure • Chipkill requires x4 DRAMs • x8 chipkill is impractical • But, x8 DRAM is more energy efficient
Baseline x4 Chipkill • Two x4 ECC DIMMs • 128bit data + 16bit ECC (redundancy overhead: 12.5%) • 4 check symbol error code using 4-bit symbol • Access granularity • 64B in DDR2 (min. burst 4 x 128 bit) • 128B in DDR3 (min. burst 8 x 128 bit) 144-bit wide data bus x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4
x8 Chipkill • x8 chipkill with the same access granularity • 152-bit wide data path • 128-bit data + 24-bit ECC • Redundancy overhead: 18.75% • Need a custom-designed DIMM • Increase the system cost a lot x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 152-bit wide data bus
x8 Chipkill /w Standard DIMMs • Increase access granularity • 128B in DDR2 (min. burst 4 x 256 bit) • 256B in DDR3 (min. burst 8 x 256 bit) x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 280-bit wide data bus
V-ECC for Chipkill • Use 3 check symbol error codes • Single Symbol-error Correct and Double Symbol-error Detect • T1EC • 2 check symbols • Detect up to 2 symbol error • T2EC • 3rd check symbol • Combined T1EC/T2EC provides Chipkill
V-ECC: ECC x4 configuration • Use 8-bit symbol error code • 2 bursts out of a x4 DRAM form an 8bit-symbol • Modern DRAMs have minimum burst of 4 or 8 • 1 x4 ECC DIMM + 1 x4 Non-ECC DIMM • Each DRAM access in DDR2 (burst 4) • 64B data, 4B T1EC • 2B T2EC is virtualized within memory namespace • 32 T2ECs per 64B cache line Virtualized within memory T2EC 136-bit wide data bus x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 Data T1EC Data
V-ECC: ECC x8 configuration • Use 8-bit symbol error code • 2 x8 ECC DIMMs • Each DRAM access in DDR2 (burst 4) • 64B data, 8B T1EC • 4B T2EC is virtualized • 16 T2ECs per 64B cache line x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 x8 Virtualized within memory T2EC 144-bit wide data bus Data T1EC Data T1EC
Flexible Error Protection • Single HW with V-ECC can provide • Chipkill-detect, Chipkill-correct, and Double chipkill-correct • Use different T2EC for different pages • Reliability – Performance tradeoff • Maximize performance/power efficiency with Chipkill-Detect • Stronger protection at the cost of additional T2EC access
Simulator/Workload • GEMS + DRAMsim • An out-of-order SPARC V9 core • Exclusive two-level cache hierarchy • DDR2 800MHz – 12.8GB/s (128-bit wide data path) • 1 channel 4 ranks • Power model • WATTCH for processor power – scaled to 45nm • CACTI for cache power – cacti 45nm • Micron model for DRAM power – commodity DRAMs • Workloads • 12 data intensive applications from SPEC CPU 2006 and PARSEC • Microbenchmarks: STREAM and GUPS
Normalized Execution Time • Less than 1% penalty on average • Performance penalty • Spatial locality • Write-back traffic
System Energy Efficiency • Energy Delay Product (EDP) gain • ECC x4: 1.1% on average • ECC x8: 12.0% on average 1.23 10% 12% 17% 20%
Flexible Error Protection Chipkill-Detect Chipkill-Correct Double Chipkill-Correct
Conclusion • Virtualized ECC • Two-tiered error protection, virtualized T2EC • Improved system energy efficiency with chipkill • Reduce DRAM power consumption by 27% • Improve system EDP by 12% • Performance penalty – 1% on average • Error protection even for Non-ECC DIMMs • Can be used for GPU memory error protection • Flexibility in error protection • Adaptive error protection level by user/system demand • Cost of error protection is proportional to protection level
Virtualized and Flexible ECC for Main Memory Doe Hyun Yoon and MattanErez Dept. Electrical and Computer Engineering The University of Texas at Austin
Virtualized ECC Operations • DRAM read • Fetch data and T1EC – detect errors • Don’t need T2EC in most cases • DRAM write-back • Update data, T1EC, and T2EC • Cache T2EC for locality on T2EC access • Need to translate PA to EA • On-chip ECC address translation unit • TLB-like structure for fast PA to EA translation • Error correction • Need to read T2EC; maybe in the LLC or DRAM
LLC 3 PA: 0x0200 B0 ECC Address Translation Unit A 4 EA: 0x0540 0 1 1 2 2 3 3 2 Wr: 0x0200 5 Wr: 0x0540 DRAM Rank 0 1 Rd: 0x00c0 Rank 1 0040 0000 00c0 0080 A 0140 0100 01c0 0180 0240 B0 0200 02c0 B1 0280 0340 B2 0300 03c0 B3 0380 0440 0400 04c0 0480 T2EC for Rank 1 data T2EC for Rank 0 data 0540 0500 0 05c0 0580 Data T1EC Data T1EC
RECAP: V-ECC • Two-tiered error protection • Uniform T1EC • Virtualized T2EC • V-ECC for chipkill • ECC x4 configuration: saves 8 data pins • ECC x8 configuration: more energy efficient • Flexible error protection • Different T2EC for different pages • Stronger protection for important data • No protection for not important data
Power Consumption • DRAM power saving • ECC x4: 4.2% • ECC x8: 27.8% • Total power saving • ECC x4: 2.1% • ECC x8: 13.2%
Caching T2EC • T2EC occupancy: Less than 10% on average • MPKI overhead: Very small • The higher spatial locality, the less impact on caching behavior
Traffic • Traffic increase – less than 10% on average • Increased demand misses; • T2EC traffic • Spatial locality is important, so is the amount of write-back traffic
Virtualized ECC • Uniform T1EC • Low-cost error detection or light-weight correction • Virtualized T2EC • Correct errors detected uncorrectable by T1EC • Cacheable and memory mapped • Read accesses data and T1EC • Don’t need T2EC in most times • Simpler common case read operations • Write updates data, T1EC, and T2EC
Flexible Error Protection • ECC x8 DRAM configuration • Stronger error protection at the cost of more T2EC accesses • Additional cost of double chip-kill (relative to chip-kill)is quite small • Adaptation is with per-page granularity
What if BW is limited? • Half DRAM BW – 6.4GB/s • Emulate CMP where BW is more scarce
ECC for non-ECC DIMMs • Virtualize ECC in memory namespace • Not a two-tiered error protection • No uniform ECC storage (for T1EC) • But, let’s say the ECC as ‘T2EC’ to keep notation consistent • Virtualized T2EC both detects and corrects errors • Now, a DRAM read also triggers a T2EC access • Increased T2EC traffic, increased T2EC occupancy, and more penalty • But, we can detect and correct errors with non-ECC DIMMs
LLC 2 PA: 0x0180 6 PA: 0x00c0 A C ECC Address Translation Unit 7 EA: 0x0510 D B 3 EA: 0x0550 1 Rd: 0x0180 8 Rd: 0x0510 5 Wr: 0x0140 4 Rd: 0x0540 DRAM Rank 0 Rank 1 0040 0000 00c0 0080 C 0140 0100 01c0 A 0180 0240 0200 02c0 0280 0340 0300 03c0 0380 0440 0400 04c0 0480 T2EC for Rank 1 data T2EC for Rank 0 data 0540 D 0500 B 05c0 0580 Data Data
DIMM configurations • Use 2 check symbol error codes • Can detect and correct up to 1 symbol error • No 2 symbol error detection • Weaker protection than Chip-Kill, but it’s better than nothing • DIMM configurations • Can even use x16 DRAMs (way more energy efficient than x4 DRAMs)
Performance and Energy Efficiency • More performance degradation (compared to ECC DIMMs) • Every read accesses T2EC • More T2EC traffic more T2EC occupancy in LLC • Energy efficiency is sometimes better • x16 DRAMs save a lot of DRAM power • Performance degradation is low if spatial locality is good
Flexible error protection • A page can have different T2EC sizes • Error protection level of a page can be • No protection • 1 chip-kill detect • 1 chip-kill correct (but can’t detect 2 chip-kill) • 2 chip-kill correct • Penalty is proportional to protection level • T2EC size per 64B cache line * It cannot detect 2 chip-kill
Non-ECC x8 Non-ECC x16
OS manages T2EC • PA to EA translation structure • T2EC storage • Only dirty pages require T2EC (with ECC DIMMs) • Can use Copy-On-Write T2EC allocation • Every data page needs T2EC in non-ECC implementation • Free T2EC when a data page is freed/evicted
PA to EA Translation • Every write-back (with ECC DIMMs) or read/write (with non-ECC DIMMs) needs to access T2EC • Translation is similar to VA to PA translaation • OS manages a single translation structure
Example Translation Physical address (PA) Level 1 Level 2 Level 3 Page offset >> ECC page table Base register + log2(T2EC) ECC table entry + ECC table entry + ECC table entry ECC page number ECC Page offset ECC address (EA)
Accelerating Translation • ECC address translation unit • Cache PA to EA translation • Like TLBs • Hierarchical caching – 2 levels • 1st level manages consistency with TLB • 2nd level as a victim cache • Read triggered translation • 100% hit; L1 EA cache is consistent with TLB • Only occurs with non-ECC DIMMs • Write triggered translation • Probably hit; L2 EA cache can be relatively large
ECC Address Translation Unit TLB ECC address translation unit To manage consistency between TLB and L1 EA cache L1 EA cache PA EA Control logic L2 EA cache EA MSHR 2-level EA cache External EA translation
Possible Impacts • TLB miss penalty • VA to PA translation, then PA to EA translation • Seems like negligible – already assumed doubled TLB miss penalty in the evaluation • Design alternative: to translate VA to EA directly • Need to manage per-process translation structure • But potentially less impact on TLB miss penalty • EA cache misses per 1000 instrs • Configuration • 16 entry FA L1 EA cache • 4k entry 8 way L2 EA cache • ~3 in omnetpp and canneal • ~12 in GUPS • Less than 1 in other apps • Things might get messed up with a software TLB handler