280 likes | 367 Views
Pay-As-You-Go Storage-Efficient Hard Error Correction. Moinuddin K. Qureshi ECE, Georgia Tech. Research done while at: IBM T. J. Watson Research Center New York. MICRO 2011 Dec 6, 2011. Introduction. PCM is a scalable technology. Device state changed by heating.
E N D
Pay-As-You-Go Storage-Efficient Hard Error Correction Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research Center New York MICRO 2011 Dec 6, 2011
Introduction PCM is a scalable technology. Device state changed by heating. Over time, write operations break heater Cell gets stuck Reported write endurance: 10-100 million writes/cell With good wear leveling still possible to have 8+ years lifetime PAY-AS-YOU-GO, MICRO-2011
Not All Cells Are Created Equal • Variability in lifetime due to process variation: weak vs. strong cells • Weak cells fail much earlier reduce system lifetime greatly • Lifetime usually modeled as Gaussian with SDEV of 10-30% of mean • We use SDEV=20% of mean • P (5 SDEV from mean) ≈ 10-6 • For 1GB memory bank, • 8K bits fail at time 0, more as we write! PCM needs significant amount of error correction to handle variability PAY-AS-YOU-GO, MICRO-2011
Write Efficient Code Traditional ECC codes are write intensive More wear Endurance related (hard) faults identified with checker read Write-efficient code: Error Correcting Pointers [ISCA’10] 1 bit Cache Line (512b) 9 bit D Pointer X 0 1 2 3 4 … 511 ECP needs 10 bits per entry. Handles multiple faults (needs 1 Full bit) For correcting N errors, ECP needs (10N+1) bits PAY-AS-YOU-GO, MICRO-2011
Expensive to Correct Many Errors ECP-5 ECP-4 ECP-6 ECP-2 ECP-1 NoECP ECP-3 1 2 3 4 5 6 7 0 Baseline System Lifetime (years) To get 6+ years lifetime, we need to correct six errors per line Storage: 61 bits/line (about 12%, 1GB for 8GB) Expensive Unlike ECC in current DRAM chips, this overhead is not optional Goal: Reduce storage significantly (3X-6X) while retaining lifetime PAY-AS-YOU-GO, MICRO-2011
Motivation Key insight: Very few lines have large number of errors Utilization of error correction entries per line Uniformly allocating error correction entries is inefficient (by ~20X) We do not need to pay for error correction of each line upfront Pay-As-You-Go: Give error correction entries in proportion to errors PAY-AS-YOU-GO, MICRO-2011
Outline • Introduction & Motivation • PAYG Design • Results • Even More Storage Efficiency • Related Work • Summary PAY-AS-YOU-GO, MICRO-2011
Naïve Design for PAYG Given 73% of lines have no error, why not give ECP-6 only on error? OFB MEMORY LINE (64B) Ways (Num GEC entries per set) Sets V TAG ECP-N GEC Entry Global Error Correction (GEC) Pool GEC Pool structure: Set associative vs. Fully associative (impractical) PAY-AS-YOU-GO, MICRO-2011
Three Key Problems Set associative structure is inefficient (by ~8X for 8-way) If we allocate six ECP entries per each GEC entry, most errorcorrection entries still remain unused Given >25% of lines are likely to have at-least on error, the latency impact of GEC is significant PAY-AS-YOU-GO, MICRO-2011
Inefficiency of Set Associative GEC There are 10s/100s of thousand of sets Any set could overflow How many entries used before one set overflows? Buckets-and-Balls An 8-way GEC only 12% full when one set overflows Need 8x entries PAY-AS-YOU-GO, MICRO-2011
Scalable Structure for GEC Pool GEC Entry OFB PTR 1 Set Associative Table (SAT) GCT-HEAD OFB 1 PTR TAKEN BY SOME OTHER SET *PTR is two-way replicated Global Collision Table (GCT) “Hash-Table With Chaining” structure for flexibility & low latency PAY-AS-YOU-GO, MICRO-2011
Scalable Structure for GEC Pool Global Collision Table (GCT) with half as many sets as SAT is sufficient Lets say we want to store N entries Proposed GEC structure has latency similar to Set Associative Table while needing 5X fewer entries PAY-AS-YOU-GO, MICRO-2011
Solving Other Two Problems • 2. Fine Grained Allocation for effectively utilizing ECP entries • Each GEC entry has only ECP-1. • Each line can have multiple GEC entries • We guarantee that all entries are in same set of (SAT/GCT) • A faulty line can get more than ECP-6 as well • 3. Local Error Correction (LEC) for low latency in common case • Each line has dedicated ECP-1 (handles 95% lines) • Ensures extra accesses (GEC) needed for only few lines PAY-AS-YOU-GO, MICRO-2011
PAYG: Tying it All Together PAYG performs on-demand allocation of error correction entries PAYG has 3 levels. LEC is first line of defense (lowers latency) SAT is second and GCT is third (flexible) PAY-AS-YOU-GO, MICRO-2011
Outline • Introduction & Motivation • PAYG Design • Results • Even More Storage Efficiency • Related Work • Summary PAY-AS-YOU-GO, MICRO-2011
Evaluation Settings Assumptions: 1. Mean writes 32 Million, SDEV=20%, no correlation 2. Perfect wear leveling all lines get same number of writes 3. Writes are converted into writes-read to detect faults Configuration: PCM bank of 1GB with 64B lines, so 16 million lines per bank Write latency of 1 micro second At 100% write traffic, lifetime is 18 years (if zero variance) Figure of Merit: Uniform ECP-6 gets 35% of ideal lifetime, so 6.5 years We report lifetime with respect to Uniform ECP-6 PAY-AS-YOU-GO, MICRO-2011
Importance of Scalable GEC Pool Total Sets 128K+64K=192K NoFGA-NoGCT NoFGA-wGCT Num GCT Sets (SAT Sets=128K) Num SAT Sets Proposed structure reduces storage overhead of GEC by more than 5X PAY-AS-YOU-GO, MICRO-2011
Importance of Fine-Grained Alloc. Fine-Grained Allocation improves the effectiveness of PAYG PAY-AS-YOU-GO, MICRO-2011
Importance of LEC We can get higher lifetime by increasing GEC size but we still need LEC 5 years Without LEC, latency impact is significant. With LEC, not so much For first 5 years, PAYG incurs on avg 1 extra access for < 0.4% accesses PAY-AS-YOU-GO, MICRO-2011
Storage Overhead (Total storage overhead to protect 1GB reduces from 122MB to 39MB, down 83MB) PAYG provides lifetime similar to ECP-8 at 3.1X less storage than ECP-6 PAY-AS-YOU-GO, MICRO-2011
Outline • Introduction & Motivation • PAYG Design • Results • Even More Storage Efficiency • Related Work • Summary PAY-AS-YOU-GO, MICRO-2011
Efficient Single Bit Correction LEC responsible for most of storage overhead (13 bits out of 19.5 bits) Need efficient schemes single bit hard faults Alternate Data Retry (ADR) ADR: Mask hard fault by storing data in either normal or inverted form INV INV SA-0 SA-0 1 1 1 0 0 0 0 1 0 1 ADR needs only 1 bit to mask a single stuck-at-fault (caveat: double write) Reduce storage overhead of PAYG by using ADR instead of ECP-1 in LEC PAY-AS-YOU-GO, MICRO-2011
Comparisons Hard to scale ADR to multiple faults. SAFER [MICRO’10] partitions lines with multiple faults into single bit faults. SAFER needs 55 bits/line and lifetime ~ECP-6 PAYG with heterogeneous error correction reduces storage by 6X PAY-AS-YOU-GO, MICRO-2011
Outline • Introduction & Motivation • PAYG Design • Results • Even More Storage Efficiency • Related Work • Summary PAY-AS-YOU-GO, MICRO-2011
Non Uniform Error Correction • Variable Strength ECC (VS-ECC) by Alameldeen+ ISCA’11 • Proposed for cache reliability at low voltages • Each way has ECC-4 for one quarter of ways, allocated based on testing • Difference: Cache line disabling works. Only set associative structure. • Layered ECP by Schechter+ ISCA’10 • ECP-1 for each line, and some ECP entries for each page • In essence, this is a set-associative GEC with ECP-1 in LEC • Difference: Set associative GEC requires 5X more entries (inefficient) • Line Sparing with FREE-p by Hyun+ HPCA’11 • A faulty line is remapped to a spare area using embedded pointer • Sparing needs 1 good line for 1 uncorrectable fault • Difference: PAYG is much more storage efficient than sparing PAY-AS-YOU-GO, MICRO-2011
FREE-p: Sparing vs. Correction For 1 extra error bit, PAYG needs 20 bit GEC entry, FREE-p needs 512 bit PAYG is more effective than line sparing with FREE-p PAY-AS-YOU-GO, MICRO-2011
Outline • Introduction & Motivation • PAYG Design • Results • Even More Storage Efficiency • Related Work • Summary PAY-AS-YOU-GO, MICRO-2011
Summary PCM: limited endurance, variability across cells reduces lifetime Need to correct many (six) errors per line Uniform allocation is expensive and inefficient (only 0.3 out of 6 used) Pay-As-You-Go (PAYG): Allocate error correction entries on demand PAYG has LEC + GEC Pool (Set Associative Table + Global Collision Table) Provides 1.13X lifetime compared to ECP-6 at 3.1X lower overhead Heterogeneous scheme (ADR for LEC) reduces storage by 6X PAYG useful for efficient hard-error correction in other technologies too PAY-AS-YOU-GO, MICRO-2011