FAST ‘08:

Storage Failures Dvir Olansky FAST ‘08: L.N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau: An Analysis of Data Corruption in the Storage Stack. W. Jiang, C. Hu, A. Kanevsky, and Y.Zhou: Are Disks the Dominant Contributor for Storage Failures?A Comprehensive Study of Storage Subsystem Failure Characteristics Advanced Topics in Storage Systems Spring 2013

Outline • Problem Addressed • Main Findings • Storage System Architecture • Results • Conclusions and Implications

Problem Addressed Storage failures from a system perspective: • Silent Data Corruptions. • Failures of storage system components besides disks. • Statistical properties of storage system failures.

Main Findings • Disk failures contribute to only 20-55% of storage system failures. • Storage failures are not independent. • Storage failures show strong spatial and temporal locality.

Storage System Architecture

Nearline Vs. Enterprise Disks • Enterprise Disks– Fiber Channel Interface Disks. • Low-end, Med-Range, High-end. • Nearline Disks – ATA Interface (mostly SATA).

Corruption Detection Mechanisms • Storage system does not knowingly propagate corrupt data to the user under any circumstance. • Data Integrity Segments in each File System block.

NetAppAutoSupport Database • Built-in, low-overhead mechanism to log important system events to a central repository. • Over 1.5M disks included in about 39,000 storage systems for a period of over 40 months. • Unprecedented sample size. • Both papers rely on this database.

Results Enterprise Nearline

Results

Results • Corrupt ES disks develop many more checksum mismatches than corrupt NL disks.

Results • Checksum mismatches within the same disk are not independent:

Results

Results • Much if the observed spatial locality is due to consecutive disk blocks developing corruption.

Results Theoretical

Conclusions and Implications • Employ redundancy mechanisms to tolerate storage system component failures – Not only disks! • f.e. physical interconnect multipathing reduce AFR by 30-40%

Conclusions and Implications • Redundant data structures should be stored distant from each other.

Conclusions and Implications • Temporal and spatial locality can be leveraged for smarter scrubbing. • Trigger a scrub before it’s next scheduled time, when probability of corruption is high. • Selective scrubbing of an area of the disk that’s likely to be affected.

Conclusions and Implications • Replacing ES disk on the first detection of corruption makes sense. • Replacement cost may not be a huge factor since the probability of the first corruption is low.

Thank You

FAST ‘08:

FAST ‘08:

Presentation Transcript

FAST Exam

HERALD TO THE GREAT FAST

How Fast is “ Fast ”? Demystifying NFRs

Fast Food

Facts About Fast Food

Fast Food needs to be banned!!!!!!!!!!

The Negatives of Fast Food

Gingerbread Man Energizer

Fast Food Is it really all that ? By: Amy, Naty, Larissa

Fast Food vs. Organic Lifestyles

FAST for Periodicals Update

FAST Phil Bainbridge – Clinical Manager Yorkshire Ambulance Service

Life In the FAST Lane Navigation in FAST Travel

Fast Luminosity Monitoring - FLUM

Fast Food Survival

Dewbot VI first thoughts

FAST Search and Transfer

Fast Faraday Cup @ AWA

Internet Networking Spring 2003