1 / 31

Dynamic Distributed Storage System Availability

Research on developing methods to sustain availability in distributed storage systems across topology changes. Evaluating replication strategies and their impact on system availability. Constraints, algorithms, and preliminary conclusions are discussed.

Download Presentation

Dynamic Distributed Storage System Availability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Elastically Replicated Information Services: Jose Torres-Berrocal Dr. Bienvenido Velez-Rivera Sustaining the Availability of Distributed Storage Across Dynamic Topological Changes Research in Process Sponsored by Program for Research in Computing and Information Sciences and Engineering (PRECISE) NSF-EIA Grant 99-77071

  2. Research Objective Develop a Method or Algorithm to dynamically sustain the availability of a distributed storage system over a desire threshold value while having topology changes.

  3. Available Failed 1 - P P Availability Definition • Availability generally refers to the probability (P) that a system is operating correctly at any given moment. State Diagram

  4. DefinitionDistributed Storage Cluster (DSC) Storage Node A distributed storage cluster (DSC) comprises two or more storage nodes which function in a coordinated fashion as a single storage system. 0 X0 Data Object N XN

  5. X1 Example of a DSC failures • When a node fails, the objects it contains become unavailable • Thus the SYSTEM becomes unavailable 1 2 Failed Node X2 System Fails due to missing object DSC with No Redundancy

  6. X1 X1 X2 Using Replication toTolerate Failures on a DSC This is what RAID’s do Failed Node X2 Object Replicas Object In Failed Node Available at Another Node DSC with Redundancy 50% No

  7. Storage Systems Must Adapt to Changes Dynamic Changes Internet Store 24/7 operation Unattended

  8. Availability as nodes are addedcompared to desired threshold • Adding nodes changes topology. • Topology changes could change at any time affecting availability. Availability Desirable g(#nodes) = Near Constant Threshold (Minimal tolerable availability) A(t) f(#nodes) = ? # Nodes

  9. Road Map • State the problem • Solution design constraints • Ongoing research • Previous work compliance • Preliminary conclusions

  10. Design Constraints for Method desirability • Distributed Storage Management • 24/7 operation • Minimal Redundancy • Works with Write intensive as well as Read intensive contexts • Minimum human intervention • Manage dynamic incidental changes due to the addition of nodes

  11. Elastically Replicated Info Services Research Methodology • Develop a mathematical model for a Distributed Storage Cluster (DSC) • Develop simulator to derive system availability • Parameters • Mean Time to Failure (MTTF) • Provided by devices manufacturers • Object count • Node count • Redundancy • Node utilization • Test alternative algorithms

  12. X2 X1 X1 X3 X0 X3 X4 X2 Math Model of a DSC 0 1 2 X0 3 4 5 6 7 8 DSC with 9 nodes/disks And 5 distinct objects DSC math model

  13. Uniform Distribution algorithm • Uniform distribution. • DSC initial state. • DSC after adding one node. • DSC after adding next node. • Keep adding nodes until #nodes = #objects.

  14. Centric algorithm • Centric. • DSC initial state. • DSC will maintain objects location as initial state while adding nodes.

  15. Utilization vs. Availability relationship Maximum Utilization (U) Minimum Utilization (U) Uniform distribution No Disk Maximum Availability (A) Minimum Availability (A) A U ? #Nodes

  16. Extreme Algorithm Results Availability Decreases even with the use of redundancy Availability decreases rapidly as nodes are added by using Uniform distribution Uniform distribution algorithm.

  17. DSC Hybrid Model – Redundancy Calculation 6 out of 10 copies 10 original objects. DSC Matrix visualization – hybrid distribution.

  18. DSC Hybrid Model –Utilization Factor Calculation 4 out of 10 nodes 2 out of 10 nodes DSC Matrix visualization – hybrid distribution.

  19. Hybrid Algorithm Results Down Region Utilization parameter affects availability more than the Up region parameter Up dist. variable and Down dist. constant. Even though availability decreases, the family of curves follow a similar trend with no significant change Up dist. Constant and Down dist. variable.

  20. Hybrid and Extreme Algorithms comparison Hybrid falls between Centric and Uniform in both parameters Overall utilization decreases by using Centric algorithm Hybrid plot is for u-50 d-5 at 50% red. Hybrid algorithm sustains availability longer than Uniform distribution

  21. Current Methods to Comply With Design Constraints • Consensus Based • Cache • RAID • Data Trading

  22. Current methods compliance with design constraints

  23. Preliminary Conclusions • Availability decreases rapidly as nodes are added when using a constant replication value on the System and maximum usability • An ERIS type method is needed. • The utilization of the System is a counter part of the availability, meaning that at increasing utilization, decreasing availability. • What actually makes the system vulnerable in terms of utilization is that the more places where the objects can be located the more opportunity is to lose an object. • The region or group of nodes where the fewer replicas are is the predominant point of failure of the System (The chain breaks on the weakest link).

  24. Current Methods Characteristics • Pre Dynamic Methods • Fit characteristics • Distributed Storage • Controlled Redundancy • Partial Fit characteristics • Works with Write intensive as well as Read intensive contexts – Depends on pre configured parameter according to a priori studies • Unfit characteristics • 24/7 operation – Has to stop operation to allow changes to pre configuration parameters • Don’t manage dynamic incidental changes to any number of nodes • Not fully automatic

  25. Consensus Based Characteristics

  26. Network Cache Method Characteristics 9 Node 21 9 Node 3 9 9 Node 20 9 9

  27. RAID Characteristics

  28. Data Trading Characteristics Node 8 B B B Node 3 Node 6 A A C C Node 3 D

  29. Simulator Validation Teoric vs. Simulator calibration curves.

More Related