1 / 20

Availability in Globally Distributed Storage Systems

Availability in Globally Distributed Storage Systems. Presented By Ala`a Ibrahim. Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici , Murray Stokely , Van- Anh Truong, Luiz Barroso, Carrie Grimes , and Sean Quinlan. OUTLINE. Introduction Disks failures Correlated Failures

yahto
Download Presentation

Availability in Globally Distributed Storage Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Availability in Globally Distributed Storage Systems Presented By Ala`a Ibrahim Daniel Ford, Franc¸ois Labelle, Florentina I. Popovici, Murray Stokely, Van-AnhTruong,LuizBarroso, CarrieGrimes, and Sean Quinlan

  2. OUTLINE • Introduction • Disks failures • Correlated Failures • Fault Tolerance MechanismsMarkov Model of Stripe Availability • Markov Model Findings • Conclusions

  3. Data Center

  4. Data Center Components Server Components Interconnects  Racks Cluster of Racks

  5. Data Center Components ALL THESE COMPONENTS CAN FAIL Server Components Interconnects  Racks Cluster of Racks

  6. Cell, Stripe and Chunk Stripe 1 Stripe 2 Stripe 1 Stripe 2 GFS Instance 1 GFS Instance 2 Chunks Chunks Chunks Chunks CELL 2 CELL 1

  7. Failure Sources • Failure Sources • Hardware – Disks, Memory etc. • Software – chunk server process • Network Interconnect • Power Distribution Unit • Availability • Reasons of unavailable • Overloaded • Crash or restart • Hardware error • Automated repair processes

  8. Disks failures Node restarts Planned machine reboots Unplanned machine reboots Unknown

  9. Fault Tolerance Mechanisms • Replication (R = n) • ‘n’ identical chunks (replication factor) are placed across storage nodes in different rack/cell/DC • Erasure Coding ( RS (n, m)) • ‘n’ distinct data blocks and ‘m’ code blocks • Can recover utmost ‘m’ blocks from the remaining ‘n-m’ blocks

  10. Replication 5 replicas 1 Chunk Fast Encoding / Decoding Very Space Inefficient

  11. Erasure Coding ‘n’ data blocks ‘m’ code blocks Encode ‘n + m’ blocks

  12. Erasure Coding ‘n’ data blocks ‘m’ code blocks Encode ‘n + m’ blocks

  13. Erasure Coding ‘n’ data blocks ‘m’ code blocks Encode ‘n’ data blocks ‘n + m’ blocks Decode Highly Space Efficient Slow Encoding / Decoding

  14. Correlated Failures • Failure Domain • Set of machines that simultaneously fails from a common source of failure • Failure Burst • Sequence of node failures each occurring within a time window ‘w’ of the next • Window 120 s

  15. Correlated Failures… Failure Burst (Window Size)

  16. Markov Model • Chunk placement policy • Cell Simulation • trace-based simulation • Priority queue

  17. Markov Chain

  18. Conclusion • The findings provides a feedback for improving • Replication and encoding schemes • Recovery rate

More Related