220 likes | 231 Views
Storage Failures. Dvir Olansky. FAST ‘08: L.N. Bairavasundaram , G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau : An Analysis of Data Corruption in the Storage Stack.
E N D
Storage Failures Dvir Olansky FAST ‘08: L.N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau: An Analysis of Data Corruption in the Storage Stack. W. Jiang, C. Hu, A. Kanevsky, and Y.Zhou: Are Disks the Dominant Contributor for Storage Failures?A Comprehensive Study of Storage Subsystem Failure Characteristics Advanced Topics in Storage Systems Spring 2013
Outline • Problem Addressed • Main Findings • Storage System Architecture • Results • Conclusions and Implications
Problem Addressed Storage failures from a system perspective: • Silent Data Corruptions. • Failures of storage system components besides disks. • Statistical properties of storage system failures.
Main Findings • Disk failures contribute to only 20-55% of storage system failures. • Storage failures are not independent. • Storage failures show strong spatial and temporal locality.
Nearline Vs. Enterprise Disks • Enterprise Disks– Fiber Channel Interface Disks. • Low-end, Med-Range, High-end. • Nearline Disks – ATA Interface (mostly SATA).
Corruption Detection Mechanisms • Storage system does not knowingly propagate corrupt data to the user under any circumstance. • Data Integrity Segments in each File System block.
NetAppAutoSupport Database • Built-in, low-overhead mechanism to log important system events to a central repository. • Over 1.5M disks included in about 39,000 storage systems for a period of over 40 months. • Unprecedented sample size. • Both papers rely on this database.
Results Enterprise Nearline
Results • Corrupt ES disks develop many more checksum mismatches than corrupt NL disks.
Results • Checksum mismatches within the same disk are not independent:
Results • Much if the observed spatial locality is due to consecutive disk blocks developing corruption.
Results Theoretical
Conclusions and Implications • Employ redundancy mechanisms to tolerate storage system component failures – Not only disks! • f.e. physical interconnect multipathing reduce AFR by 30-40%
Conclusions and Implications • Redundant data structures should be stored distant from each other.
Conclusions and Implications • Temporal and spatial locality can be leveraged for smarter scrubbing. • Trigger a scrub before it’s next scheduled time, when probability of corruption is high. • Selective scrubbing of an area of the disk that’s likely to be affected.
Conclusions and Implications • Replacing ES disk on the first detection of corruption makes sense. • Replacement cost may not be a huge factor since the probability of the first corruption is low.