1 / 28

Pay-As-You-Go Storage-Efficient Hard Error Correction

Pay-As-You-Go Storage-Efficient Hard Error Correction. Moinuddin K. Qureshi ECE, Georgia Tech. Research done while at: IBM T. J. Watson Research Center New York. MICRO 2011 Dec 6, 2011. Introduction. PCM is a scalable technology. Device state changed by heating.

joanne
Download Presentation

Pay-As-You-Go Storage-Efficient Hard Error Correction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pay-As-You-Go Storage-Efficient Hard Error Correction Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research Center New York MICRO 2011 Dec 6, 2011

  2. Introduction PCM is a scalable technology. Device state changed by heating. Over time, write operations break heater  Cell gets stuck Reported write endurance: 10-100 million writes/cell With good wear leveling still possible to have 8+ years lifetime PAY-AS-YOU-GO, MICRO-2011

  3. Not All Cells Are Created Equal • Variability in lifetime due to process variation: weak vs. strong cells • Weak cells fail much earlier  reduce system lifetime greatly • Lifetime usually modeled as Gaussian with SDEV of 10-30% of mean • We use SDEV=20% of mean • P (5 SDEV from mean) ≈ 10-6 • For 1GB memory bank, • 8K bits fail at time 0, more as we write! PCM needs significant amount of error correction to handle variability PAY-AS-YOU-GO, MICRO-2011

  4. Write Efficient Code Traditional ECC codes are write intensive  More wear Endurance related (hard) faults identified with checker read Write-efficient code: Error Correcting Pointers [ISCA’10] 1 bit Cache Line (512b) 9 bit D Pointer X 0 1 2 3 4 … 511 ECP needs 10 bits per entry. Handles multiple faults (needs 1 Full bit) For correcting N errors, ECP needs (10N+1) bits PAY-AS-YOU-GO, MICRO-2011

  5. Expensive to Correct Many Errors ECP-5 ECP-4 ECP-6 ECP-2 ECP-1 NoECP ECP-3 1 2 3 4 5 6 7 0 Baseline System Lifetime (years) To get 6+ years lifetime, we need to correct six errors per line Storage: 61 bits/line (about 12%, 1GB for 8GB) Expensive Unlike ECC in current DRAM chips, this overhead is not optional Goal: Reduce storage significantly (3X-6X) while retaining lifetime PAY-AS-YOU-GO, MICRO-2011

  6. Motivation Key insight: Very few lines have large number of errors Utilization of error correction entries per line Uniformly allocating error correction entries is inefficient (by ~20X) We do not need to pay for error correction of each line upfront Pay-As-You-Go: Give error correction entries in proportion to errors PAY-AS-YOU-GO, MICRO-2011

  7. Outline • Introduction & Motivation • PAYG Design • Results • Even More Storage Efficiency • Related Work • Summary PAY-AS-YOU-GO, MICRO-2011

  8. Naïve Design for PAYG Given 73% of lines have no error, why not give ECP-6 only on error? OFB MEMORY LINE (64B) Ways (Num GEC entries per set) Sets V TAG ECP-N GEC Entry Global Error Correction (GEC) Pool GEC Pool structure: Set associative vs. Fully associative (impractical) PAY-AS-YOU-GO, MICRO-2011

  9. Three Key Problems Set associative structure is inefficient (by ~8X for 8-way) If we allocate six ECP entries per each GEC entry, most errorcorrection entries still remain unused Given >25% of lines are likely to have at-least on error, the latency impact of GEC is significant PAY-AS-YOU-GO, MICRO-2011

  10. Inefficiency of Set Associative GEC There are 10s/100s of thousand of sets  Any set could overflow How many entries used before one set overflows? Buckets-and-Balls An 8-way GEC only 12% full when one set overflows  Need 8x entries PAY-AS-YOU-GO, MICRO-2011

  11. Scalable Structure for GEC Pool GEC Entry OFB PTR 1 Set Associative Table (SAT) GCT-HEAD OFB 1 PTR TAKEN BY SOME OTHER SET *PTR is two-way replicated Global Collision Table (GCT) “Hash-Table With Chaining” structure for flexibility & low latency PAY-AS-YOU-GO, MICRO-2011

  12. Scalable Structure for GEC Pool Global Collision Table (GCT) with half as many sets as SAT is sufficient Lets say we want to store N entries Proposed GEC structure has latency similar to Set Associative Table while needing 5X fewer entries PAY-AS-YOU-GO, MICRO-2011

  13. Solving Other Two Problems • 2. Fine Grained Allocation for effectively utilizing ECP entries • Each GEC entry has only ECP-1. • Each line can have multiple GEC entries • We guarantee that all entries are in same set of (SAT/GCT) • A faulty line can get more than ECP-6 as well • 3. Local Error Correction (LEC) for low latency in common case • Each line has dedicated ECP-1 (handles 95% lines) • Ensures extra accesses (GEC) needed for only few lines PAY-AS-YOU-GO, MICRO-2011

  14. PAYG: Tying it All Together PAYG performs on-demand allocation of error correction entries PAYG has 3 levels. LEC is first line of defense (lowers latency) SAT is second and GCT is third (flexible) PAY-AS-YOU-GO, MICRO-2011

  15. Outline • Introduction & Motivation • PAYG Design • Results • Even More Storage Efficiency • Related Work • Summary PAY-AS-YOU-GO, MICRO-2011

  16. Evaluation Settings Assumptions: 1. Mean writes 32 Million, SDEV=20%, no correlation 2. Perfect wear leveling all lines get same number of writes 3. Writes are converted into writes-read to detect faults Configuration: PCM bank of 1GB with 64B lines, so 16 million lines per bank Write latency of 1 micro second At 100% write traffic, lifetime is 18 years (if zero variance) Figure of Merit: Uniform ECP-6 gets 35% of ideal lifetime, so 6.5 years We report lifetime with respect to Uniform ECP-6 PAY-AS-YOU-GO, MICRO-2011

  17. Importance of Scalable GEC Pool Total Sets 128K+64K=192K NoFGA-NoGCT NoFGA-wGCT Num GCT Sets (SAT Sets=128K) Num SAT Sets Proposed structure reduces storage overhead of GEC by more than 5X PAY-AS-YOU-GO, MICRO-2011

  18. Importance of Fine-Grained Alloc. Fine-Grained Allocation improves the effectiveness of PAYG PAY-AS-YOU-GO, MICRO-2011

  19. Importance of LEC We can get higher lifetime by increasing GEC size but we still need LEC 5 years Without LEC, latency impact is significant. With LEC, not so much For first 5 years, PAYG incurs on avg 1 extra access for < 0.4% accesses PAY-AS-YOU-GO, MICRO-2011

  20. Storage Overhead (Total storage overhead to protect 1GB reduces from 122MB to 39MB, down 83MB) PAYG provides lifetime similar to ECP-8 at 3.1X less storage than ECP-6 PAY-AS-YOU-GO, MICRO-2011

  21. Outline • Introduction & Motivation • PAYG Design • Results • Even More Storage Efficiency • Related Work • Summary PAY-AS-YOU-GO, MICRO-2011

  22. Efficient Single Bit Correction LEC responsible for most of storage overhead (13 bits out of 19.5 bits) Need efficient schemes single bit hard faults  Alternate Data Retry (ADR) ADR: Mask hard fault by storing data in either normal or inverted form INV INV SA-0 SA-0 1 1 1 0 0 0 0 1 0 1 ADR needs only 1 bit to mask a single stuck-at-fault (caveat: double write) Reduce storage overhead of PAYG by using ADR instead of ECP-1 in LEC PAY-AS-YOU-GO, MICRO-2011

  23. Comparisons Hard to scale ADR to multiple faults. SAFER [MICRO’10] partitions lines with multiple faults into single bit faults. SAFER needs 55 bits/line and lifetime ~ECP-6 PAYG with heterogeneous error correction reduces storage by 6X PAY-AS-YOU-GO, MICRO-2011

  24. Outline • Introduction & Motivation • PAYG Design • Results • Even More Storage Efficiency • Related Work • Summary PAY-AS-YOU-GO, MICRO-2011

  25. Non Uniform Error Correction • Variable Strength ECC (VS-ECC) by Alameldeen+ ISCA’11 • Proposed for cache reliability at low voltages • Each way has ECC-4 for one quarter of ways, allocated based on testing • Difference: Cache line disabling works. Only set associative structure. • Layered ECP by Schechter+ ISCA’10 • ECP-1 for each line, and some ECP entries for each page • In essence, this is a set-associative GEC with ECP-1 in LEC • Difference: Set associative GEC requires 5X more entries (inefficient) • Line Sparing with FREE-p by Hyun+ HPCA’11 • A faulty line is remapped to a spare area using embedded pointer • Sparing needs 1 good line for 1 uncorrectable fault • Difference: PAYG is much more storage efficient than sparing PAY-AS-YOU-GO, MICRO-2011

  26. FREE-p: Sparing vs. Correction For 1 extra error bit, PAYG needs 20 bit GEC entry, FREE-p needs 512 bit PAYG is more effective than line sparing with FREE-p PAY-AS-YOU-GO, MICRO-2011

  27. Outline • Introduction & Motivation • PAYG Design • Results • Even More Storage Efficiency • Related Work • Summary PAY-AS-YOU-GO, MICRO-2011

  28. Summary PCM: limited endurance, variability across cells reduces lifetime Need to correct many (six) errors per line Uniform allocation is expensive and inefficient (only 0.3 out of 6 used) Pay-As-You-Go (PAYG): Allocate error correction entries on demand PAYG has LEC + GEC Pool (Set Associative Table + Global Collision Table) Provides 1.13X lifetime compared to ECP-6 at 3.1X lower overhead Heterogeneous scheme (ADR for LEC) reduces storage by 6X PAYG useful for efficient hard-error correction in other technologies too PAY-AS-YOU-GO, MICRO-2011

More Related