Understanding the Robustness of SSDs under Power F ault

Understanding the Robustness of SSDs under Power Fault 서동화 dhdh0113@gmail.com

Contents • Introduction • Background • Testing Framework • Experimental result

Introduction • Flash-based solid state disks(SSDs) • a “truly revolutionary and disruptive” technology. • Greater performance. • Lower power draw. • the behavior of flash memory in adverse conditions has only been studied at a component level. • Given the opaque and confidential nature of FTL. • The behavior of full devices in unusual conditions is still a mystery to public. • This paper considers the behavior of SSDs under power fault. • Although loss of power seems like an easy fault to prevent, recent experience shows that a simple loss of power is still a distressingly frequent occurrence.

Introduction • Power fault case • HOSTING • Jul. 2012 “… human error was responsible for a data center POWER OUTAGES…” • Amazon • Jun. 2012 “Amazon Data Center LOSES POWER During Storm..” • Amazon • May 2010 “Car Crash Triggers Amazon POWER OUTAGE…” • iWeb • 2010 “About 3,000 servers at Montreal web hos iWeb experienced an OUTAGES …” • And so on…

Background • NAND Flash Low-Level Details • The floating gate inside a NAND flash cell is susceptible to a variety of faults that may cause data corruption. • Write endurance • Program disturb • Read disturb • aging

Background • NAND Flash Low-Level Details <erase> <write> <read>

Reference • Write disturb • Program disturb <Characterizing Flash Memory: Anomalies, Observations, and Applications>

Reference • Read disturb <Characterizing Flash Memory: Anomalies, Observations, and Applications>

Background • SSD High-Level Concerns • SSD using firmware called “FTL” to make device appear as if it can do update-in-place. • The primary responsibility of an FTL is to maintain a mapping between logical and physical addresses. • Remapping table are typically stored in a volatile write back cache. • Due to cost considerations, manufactures typically attempt to minimize the size of the write-back cache as well as the capacitor backing it. • Loss of power during program operations can make the flash cells more susceptible to other faults. • Erase operations are also susceptible to power loss, since they take much longer to complete than program operations.

Testing Framework • Types of failures • Bit Corruption • Metadata Corruption • Dead Device before power fault after power fault

Testing Framework • Types of failures • Shorn Writes • Flying Writes after power fault before power fault

Testing Framework • Types of failures • Bit corruption • Half-programmed flash cells are susceptible to bit errors. • Flying writes • due to corruption and missing updates in the FTL’s remapping tables. • Shorn writes • Because single operations may be internally remapped to multiple flash chips to improve throughput. • Metadata corruption • Because an FTL is a complex piece of software and corruption of its internal state could be problematic. • Unserializable writes • Due to high degree of parallelism inside an SSD.

Testing Framework • Types of failures • Local consistency • Most of the faults can be detected using local-only data. • Either a record is correct or it its not. • Global consistency • Unserializabilityis more complex property. • Whether the result of a workload is serializable depends not only on individual records, but on how they can fit into a total order of all the operations.

Testing Framework • Detecting local failures • In order to detect local failures, we need to write records that can be checked for consistency.

Testing Framework • Dealing with complex FTLs • Naive padding • Random number padding • Pad with copies of the header • Advanced FTL’s compression • In order to avoid such compression, we further perform rando-mization on the regular record format

Testing Framework • Detecting global failures • Unserializability is not a property of a single record and thus cannot be tested with fairly local information. • During a power fault, we expect that some FTLs may fail to persist outstanding writes to the flash, or may lose mapping table updates. • We call such misordered or missing operations unseiralized writes.

Testing Framework • Detecting global failures • To detect unserializability, we need information about the completion time of each write. • We make use of the time when the records were created.

Testing Framework • Applying workloads • Random writes • Concurrent sequential writes • Single-threaded sequential writes • Power fault injection • Putting it together

Experimental result • Experimental Environment • We selected fifteen representative SSDs from five different vendors. • For comparison purposes, we also evaluated two traditional hard drives. • The SSDs and the hard drives are used as raw devices. • No file system is created on the devices. • We use synchronized I/O. • Which means each write operation does not return until its data is flushed to the devices. • Bypass the buffer cache. • Scenarios • Power fault during concurrent random writes. • Power fault during concurrent sequential writes. • Power fault during single-threaded sequential writes.

Experimental result • Overall Results • We found that 13 out of 15 devices exhibit failure. • In SSD#3, about one third of data was lost due to one third of the device becoming inaccessible. • In SSD#1, all of its data was lost. What the hell …

Experimental result • Bit corruption • One common way to deal with bit errors is using ECC. • Number of chip-level bit errors under power failure could exceed the correction capability of ECC. • Shorn writes • This shows that shorn writes is not a rare failure mode under power fault. • Subpage programming

Experimental result • Unserializable writes • No relationship between the number of serialization errors and a SSD’s unit price stands out except for the fact that the most expensive SLC. • Scenario • 1) uncompleted program 2) FTL 3) old record

Experimental result • Metadata corruption • After 8 injected power faults, only 69.5% of all the records can be retrieved from SSD#3. • This corruption makes 30.5% of the flash memory space unavailable. • We assume corruption of metadata. • Dead device • After 136 injected power faults, SSD#1 became completely useless. • All of the data stored on it was lost. • Loss of metadata • Power spike during power loss

Understanding the Robustness of SSDs under Power F ault