1 / 22

FAST ‘08:

Storage Failures. Dvir Olansky. FAST ‘08: L.N. Bairavasundaram , G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau : An Analysis of Data Corruption in the Storage Stack.

manne
Download Presentation

FAST ‘08:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storage Failures Dvir Olansky FAST ‘08: L.N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau: An Analysis of Data Corruption in the Storage Stack. W. Jiang, C. Hu, A. Kanevsky, and Y.Zhou: Are Disks the Dominant Contributor for Storage Failures?A Comprehensive Study of Storage Subsystem Failure Characteristics Advanced Topics in Storage Systems Spring 2013

  2. Outline • Problem Addressed • Main Findings • Storage System Architecture • Results • Conclusions and Implications

  3. Problem Addressed Storage failures from a system perspective: • Silent Data Corruptions. • Failures of storage system components besides disks. • Statistical properties of storage system failures.

  4. Main Findings • Disk failures contribute to only 20-55% of storage system failures. • Storage failures are not independent. • Storage failures show strong spatial and temporal locality.

  5. Storage System Architecture

  6. Storage System Architecture

  7. Nearline Vs. Enterprise Disks • Enterprise Disks– Fiber Channel Interface Disks. • Low-end, Med-Range, High-end. • Nearline Disks – ATA Interface (mostly SATA).

  8. Corruption Detection Mechanisms • Storage system does not knowingly propagate corrupt data to the user under any circumstance. • Data Integrity Segments in each File System block.

  9. NetAppAutoSupport Database • Built-in, low-overhead mechanism to log important system events to a central repository. • Over 1.5M disks included in about 39,000 storage systems for a period of over 40 months. • Unprecedented sample size. • Both papers rely on this database.

  10. Results Enterprise Nearline

  11. Results

  12. Results • Corrupt ES disks develop many more checksum mismatches than corrupt NL disks.

  13. Results • Checksum mismatches within the same disk are not independent:

  14. Results

  15. Results

  16. Results • Much if the observed spatial locality is due to consecutive disk blocks developing corruption.

  17. Results Theoretical

  18. Conclusions and Implications • Employ redundancy mechanisms to tolerate storage system component failures – Not only disks! • f.e. physical interconnect multipathing reduce AFR by 30-40%

  19. Conclusions and Implications • Redundant data structures should be stored distant from each other.

  20. Conclusions and Implications • Temporal and spatial locality can be leveraged for smarter scrubbing. • Trigger a scrub before it’s next scheduled time, when probability of corruption is high. • Selective scrubbing of an area of the disk that’s likely to be affected.

  21. Conclusions and Implications • Replacing ES disk on the first detection of corruption makes sense. • Replacement cost may not be a huge factor since the probability of the first corruption is low.

  22. Thank You

More Related