Data Dependent Sparing to Manage Better-Than-Bad Blocks *

Data Dependent Sparing to Manage Better-Than-Bad Blocks* RakanMaddah1, SangyeunCho2, 1, and Rami Melhem1 1Computer Science Department, University of Pittsburgh 2Memory Division, Samsung Electronics Co *Published in IEEE CAL. Manuscript accessible at: http://people.cs.pitt.edu/~rmaddah/cal.pdf

Introduction • Bad block management is a vital technique for memories subject to relatively low write endurance • NAND Flash can sustain 103 to 105 program/erase cycles • Phase Change Memory (PCM) can sustain 106 to 108set/reset cycles • Bad block: a block with a number of defective cells that can result in more errors than the capability of the error correction code • The common practice is to replace a bad block with a good spare block after the first write failure • 20% sparing is typical for server products

Motivation • Reconsider the bad block management technique • PCM as well NAND flash exhibit a stuck-at fault model • A failed cell gets stuck permanently at either 0 or 1 • A stuck-at cell can still be read but not reprogrammed • Failures in the context of the stuck-at fault model are data dependent!

Data-Dependent Failures Physical state • A Write on a storage block having a number of faults greater than the capability of the error correction code does not necessarily fail! Write Request Errors after write Write request Errors after write Write request Errors after write

Data-Dependent Failures Physical state • Example: With an ECC code of capability 2, only 1 write out of the 3 fails Write Request Errors after write Write request Errors after write Write request Errors after write

Block Write Failure • Block write failure probability vs. # of faults within a 4KB storage block, when an error correction mechanism covers up to 20 errors # Faults

Block Classification • Classify storage blocks into three categories: • Good: a block with no write failures • Better-Than-Bad: a block with rare write failures • Bad: a block with frequent write failures • More lifetime can still be squeezed from better-than-Bad block! • Observation: retiring a block after the first write failure is overly conservative

Data Dependent Sparing • Delay block retirement • Temporally borrow a spare block after a write failure • Attempt a later write request on the original (faulty) block and reclaim spare block in case of write success • Retire a block when frequent write failures start to occur i.e. a better-than-bad block becomes bad Primary Storage Blocks Spare Blocks Spare Blocks Primary Storage Blocks Write Requests Write Requests Later Writes

Execution Flow Read Verification Keep track of “goodness” Write a block Write Successful? No Yes Failure frequency > Threshold Reclaim assigned spare, if any Yes No Retire block and replace it with a spare Obtain a spare to write to; do not retire block

Alternative Design Strategies • Spare Allocation Strategies • Temp-Sparing: a healthy spare block temporally substitutes a better-than-bad block • Role-Exchange: a spare block permanently replaces the failing block which is added to the pool of spare blocks • Block mapping Strategies • If temp-sparing is adopted, then keep a table that stores pointers to spare blocks • If role-exchange is adopted, then update address remapping table in SSD • Determining Block “goodness” Strategies • A counter per better-than-bad blocks • A global data structure that approximate individual counters e.g. counting bloom filter Primary Storage Spare Storage

Evaluation • Monte Carlo Simulation • Simulation of 2000 Storage blocks of size 4KB each • Assign lifetime to each storage cell out of Gaussian distribution • PCM: mean 108 and stdev 25x106 • NAND Flash: mean 8.27x105 and stdev 2.48x105 • Assume perfect wear leveling • Protect each storage block with BCH code of capability n

Lifetime Improvement • Lifetime of PCM blocks with BCH-20 and 10% failure frequency threshold. “DD” denotes data dependent sparing and “SS” static sparing DD(PCM) SS(PCM) 18.1% 78%

Sensitivity to Over-Provisioning • Lifetime increase achieved by data dependent sparing at various levels of over-provisioning compared with static sparing with 20% over-provisioning

Sensitivity of BCH Capability • Lifetime increase achieved by data dependent sparing relative to static sparing for various BCH code capabilities

Sparing Overhead Reduction • Required over-provisioning for data dependent sparing to match static sparing lifetime

Conclusion • Data Dependent Sparing is a new bad block management technique • Introduces the concept of better-than-bad blocks • Delays the retirement of blocks through engaging better-than-bad block in write operations • Data Dependent Sparing can be used to either extend the lifetime of storage devices or achieve a target lifetime with fewer spares

Thank You!

Data Dependent Sparing to Manage Better-Than-Bad Blocks *