250 likes | 522 Views
Efficient Scrub Mechanisms for Error-Prone Emerging Memories. Manu Awasthi ǂ , Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran ‡ , Viji Srinivasan ‡ ǂ Micron, ⁺ University of Utah, ‡ IBM Research. Executive Summary.
E N D
Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthiǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran‡, Viji Srinivasan‡ ǂMicron, ⁺University of Utah, ‡IBM Research
Executive Summary • Future memory hierarchies will likely include NVMs • Focus of this work: Phase Change Memory (PCM) • Multi level cells in PCM appear imminent • A number of proposals exist to handle hard errors and lifetime issues of PCM devices • Resistance Drift is a lesser explored phenomenon • Will become primary cause of “soft errors” -- Need to explore holistic solutions to counter drift
Phase Change Memory - MLC (01) (11) (10) (00) Crystalline Amorphous Resistance (110) (101) (100) (011) (010) (001) (111) (000) • Chalcogenide material can exist in crystalline or amorphous states • The material can also be programmed into intermediate states • Leads to many intermediate states, paving way for Multi Level Cells (MLCs)
What is Resistance Drift? Time 11 10 01 00 ERROR!! Tn B T0 A Resistance
Resistance Drift - Issues Rdrift(t) = R0х (t)α • Programmed resistance drifts according to power law equation - • R0, α usually follow a Gaussian distribution • Time to drift (error) depends on • Programmed resistance (R0), and • Drift Coefficient (α) • Is highly unpredictable!!
Resistance Drift - How it happens ERROR!! 11 10 01 00 Number of Cells R0 R0 Rt Rt • Median case cell • Typical R0 • Typical α • Worst case cell • High R0 • High α Scrub rate will be dictated by the Worst Case R0 and Worst Case α
Resistance Drift Data (11) (01) (10) (00)
Inadequacy of Device Level Solutions • Precise writes –> write – compare – write • Can be effective, requires multiple iterations • Increases write time, total write energy, decreases lifetime • Correcting values during read not effective • Coding techniques • Can help, but cannot eliminate the problem by themselves
Naïve Solution : Scrubbing Refresh should be reactionary NOT precautionary! • Drift resets with every cell reprogram (write) • Leverage existing error correction mechanisms, e.g., ECC - has its own drawbacks • A Full Refresh/Scrub (read-compare-write) is extremely costly in PCM • Each PCM write takes 100 - 1000ns • Writing to a 2-bit cell may consume as much as 1.6nJ • Requires 600 refreshes in parallel
Solution Outline • A three pronged approach • Incorporate stronger ECC codes • Incorporate low-cost error detection mechanisms • Adapt scrub algorithms that are conservative and adaptive
Additional ECC support • With stronger ECC support, scrub operations can be made infrequent • If E errors can be corrected and E+1 detected, then scrub operation can be postponed till the Eth error • Can provide support for hard error tolerance as well • We advocate ECC-8 (E-8) • Correct eight errors, detect nine • Leads to 14.25% storage overhead, for 512 bits
LARDD Read Line Check for Errors True Errors < E After N cycles False Scrub Line • Light Array Reads for Drift Detection • Support for E Error-correcting, E+1 error detecting codes assumed • Lines are read periodically and checked for correctness • Only after the number of errors reaches a threshold, scrubbingis performed • Drift detection has to be simple, on-chip
Read Pipeline with LARDD & Parity (11) (01) (10) (00) 15 15 • Simple error detection mechanism to bypass complex BCH circuit • 256 cell line divided up into eight 32-cell fields • Count number of drift prone states, store as parity information • For every LARDD, check if parity has changed • If true, invoke BCH circuitry
Headroom Read Line Check for Errors True Errors < E-h After N cycles False Scrub Line • Headroom-h scheme – scrub is triggered if E-h errors are detected • Decreases probabilityof errors slipping through • Increases frequency of full scrub and hence decreases life time • Presents trade-off between Hard and Soft errors
Headroom Gradual Read Line After N cycles Check for Errors True Errors <= E-h-g False Double LARDD rate • Start with a small LARDD frequency • Increase frequency as drift-based errors increase • Decreases energy overhead • Might cause slight increase in error rate
Adaptive Scrub • Dynamic events can affect reliability • Temperature increases can increase α and decrease drift time • Cell lifetime/wearout is also an issue • Soft error rate depends on prevalence of drift prone states • Hard errors show up with aging cells • Start with headroom h policy, get rid of headroom when at least E-h-1 hard errors • Can increase LARDD frequency when total number of hard errors increases preset threshold
Results Stronger ECC + LARDD support leads to decrease in unrecoverable errors
Results - Headroom Schemes • Compared to ECC-1, for 512 s LARDD interval • ECC-8-headroom-3 leads to 99.6% reduction in error rate, 35.4% reduction in scrub energy • ECC-8-headroom-3-gradual-1 leads to 96.5% reduction in error rate, 37.8% reduction in scrub energy
Device Level Solution – Non Uniform Banding Before Mean R0 11 10 01 00 After Resistance
Comparing with Device Level Solutions ECC-8-headroom-3 provides for least number of uncorrectable errors
Conclusions • Resistance drift will exacerbate with MLC scaling • Naïve solutions based on ECC support are costly for PCM • Increased write energy, decreased lifetimes • Holistic solutions need to be explored to counter drift at device, architectural and system levels • 39% reduction in energy, 4x less errors, 102x increase in lifetime