1 / 37

Availability in Globally Distributed Storage Systems

Availability in Globally Distributed Storage Systems. Derek Weitzel. Failures in the System. Two major components in a Node. Applications. System. Failures in the System. Nebraska. Google. Bigtable. Cluster Scheduler. Application. GFS. Hadoop. File Systems. File Systems. System.

consuelo
Download Presentation

Availability in Globally Distributed Storage Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Availability in Globally Distributed Storage Systems Derek Weitzel

  2. Failures in the System • Two major components in a Node Applications System

  3. Failures in the System Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive

  4. Failures in the System • Similar systems at Nebraska Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive

  5. Failures in the System • Similar systems at Nebraska Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems System Hard Drive Hard Drive Failure will cause unavailability

  6. Failures in the System • Similar systems at Nebraska Nebraska Google Bigtable Cluster Scheduler Application GFS Hadoop File Systems File Systems Could cause data loss System Hard Drive Hard Drive Failure will cause unavailability

  7. Unavailability: Defined • Data on a node is unreachable • Detection: • Periodic heartbeats are missing • Correction: • Lasts until node comes back • System recreates the data

  8. Unavailability: Measured

  9. Unavailability: Measured Replication Starts

  10. Unavailability: Measured Question: After replication starts, why does it take so long to recover? Replication Starts

  11. Node Availability Storage Software Restart

  12. Node Availability Storage Software Restart Software is fast to restart

  13. Node Availability: Time Planned Reboots

  14. Node Availability: Time Node updates (planned reboots) cause the most downtime. Planned Reboots

  15. MTTF for Components • Even though Disk failure can cause data loss, node failure is much more often • Conclusion: Node failure is more important to system availability

  16. Correlated Failures • Large number of nodes failing in a burst can reduce effectiveness of replication and encoding schemes • Losing nodes before replication can start can cause unavailability of data

  17. Correlated Failures

  18. Correlated Failures Rolling Reboots of cluster

  19. Correlated Failures Oh s*!t, datacenter on fire! (maybe not that bad)

  20. Coping with Failure

  21. Coping with Failure Encoding Replication

  22. Coping with Failure Encoding Replication 27,000 Years 27.3 M Years 3 replicas is standard in large clusters

  23. Coping with Failure Cell Replication (Datacenter Replication)

  24. Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A

  25. Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A

  26. Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A

  27. Cell Replication Cell 1 Cell 2 Block A Block A Block A Block A

  28. Modeling Failures We’ve seen the data, now lets model the behavior.

  29. Modeling Failures • A chunk of data can be in one of many states. • Consider when Replication = 3 3 2 1 0 Lose a replica, but still 2 available

  30. Modeling Failures • A chunk of data can be in one of many states. • Consider when Replication = 3 Recovery 3 2 1 0 0 replicas = service unavailable

  31. Modeling Failures • Each loss of a replica has a probability • The recovery rate is also known Recovery 3 2 1 0 0 replicas = service unavailable

  32. Markov Model ρ= recovery λ= failure rate s = block replications r = minimum replication

  33. Modeling Failures • Using Markov models, we can find:

  34. Modeling Failures • Using Markov models, we can find: 402 Years Nebraska

  35. Modeling Failures • For Multi-Cell Implementations

  36. Paper Conclusions • Given enormous amount of data from Google, can say: • Failures are typically short • Node failures can happen in bursts, and are not independent • In modern distributed file systems, disk failure is the same as node failure. • Built Markov Model for failures that accurately reason about past and future availability.

  37. My Conclusions • This paper contributed greatly by showing data from very large scale distributed file systems. • If Reed – Solomon striping is so much more efficient, why isn’t it used by Google? Hadoop? Facebook? • Complicated code? • Complicated administration?

More Related