310 likes | 366 Views
Research on developing methods to sustain availability in distributed storage systems across topology changes. Evaluating replication strategies and their impact on system availability. Constraints, algorithms, and preliminary conclusions are discussed.
E N D
Elastically Replicated Information Services: Jose Torres-Berrocal Dr. Bienvenido Velez-Rivera Sustaining the Availability of Distributed Storage Across Dynamic Topological Changes Research in Process Sponsored by Program for Research in Computing and Information Sciences and Engineering (PRECISE) NSF-EIA Grant 99-77071
Research Objective Develop a Method or Algorithm to dynamically sustain the availability of a distributed storage system over a desire threshold value while having topology changes.
Available Failed 1 - P P Availability Definition • Availability generally refers to the probability (P) that a system is operating correctly at any given moment. State Diagram
DefinitionDistributed Storage Cluster (DSC) Storage Node A distributed storage cluster (DSC) comprises two or more storage nodes which function in a coordinated fashion as a single storage system. 0 X0 Data Object N XN
X1 Example of a DSC failures • When a node fails, the objects it contains become unavailable • Thus the SYSTEM becomes unavailable 1 2 Failed Node X2 System Fails due to missing object DSC with No Redundancy
X1 X1 X2 Using Replication toTolerate Failures on a DSC This is what RAID’s do Failed Node X2 Object Replicas Object In Failed Node Available at Another Node DSC with Redundancy 50% No
Storage Systems Must Adapt to Changes Dynamic Changes Internet Store 24/7 operation Unattended
Availability as nodes are addedcompared to desired threshold • Adding nodes changes topology. • Topology changes could change at any time affecting availability. Availability Desirable g(#nodes) = Near Constant Threshold (Minimal tolerable availability) A(t) f(#nodes) = ? # Nodes
Road Map • State the problem • Solution design constraints • Ongoing research • Previous work compliance • Preliminary conclusions
Design Constraints for Method desirability • Distributed Storage Management • 24/7 operation • Minimal Redundancy • Works with Write intensive as well as Read intensive contexts • Minimum human intervention • Manage dynamic incidental changes due to the addition of nodes
Elastically Replicated Info Services Research Methodology • Develop a mathematical model for a Distributed Storage Cluster (DSC) • Develop simulator to derive system availability • Parameters • Mean Time to Failure (MTTF) • Provided by devices manufacturers • Object count • Node count • Redundancy • Node utilization • Test alternative algorithms
X2 X1 X1 X3 X0 X3 X4 X2 Math Model of a DSC 0 1 2 X0 3 4 5 6 7 8 DSC with 9 nodes/disks And 5 distinct objects DSC math model
Uniform Distribution algorithm • Uniform distribution. • DSC initial state. • DSC after adding one node. • DSC after adding next node. • Keep adding nodes until #nodes = #objects.
Centric algorithm • Centric. • DSC initial state. • DSC will maintain objects location as initial state while adding nodes.
Utilization vs. Availability relationship Maximum Utilization (U) Minimum Utilization (U) Uniform distribution No Disk Maximum Availability (A) Minimum Availability (A) A U ? #Nodes
Extreme Algorithm Results Availability Decreases even with the use of redundancy Availability decreases rapidly as nodes are added by using Uniform distribution Uniform distribution algorithm.
DSC Hybrid Model – Redundancy Calculation 6 out of 10 copies 10 original objects. DSC Matrix visualization – hybrid distribution.
DSC Hybrid Model –Utilization Factor Calculation 4 out of 10 nodes 2 out of 10 nodes DSC Matrix visualization – hybrid distribution.
Hybrid Algorithm Results Down Region Utilization parameter affects availability more than the Up region parameter Up dist. variable and Down dist. constant. Even though availability decreases, the family of curves follow a similar trend with no significant change Up dist. Constant and Down dist. variable.
Hybrid and Extreme Algorithms comparison Hybrid falls between Centric and Uniform in both parameters Overall utilization decreases by using Centric algorithm Hybrid plot is for u-50 d-5 at 50% red. Hybrid algorithm sustains availability longer than Uniform distribution
Current Methods to Comply With Design Constraints • Consensus Based • Cache • RAID • Data Trading
Preliminary Conclusions • Availability decreases rapidly as nodes are added when using a constant replication value on the System and maximum usability • An ERIS type method is needed. • The utilization of the System is a counter part of the availability, meaning that at increasing utilization, decreasing availability. • What actually makes the system vulnerable in terms of utilization is that the more places where the objects can be located the more opportunity is to lose an object. • The region or group of nodes where the fewer replicas are is the predominant point of failure of the System (The chain breaks on the weakest link).
Current Methods Characteristics • Pre Dynamic Methods • Fit characteristics • Distributed Storage • Controlled Redundancy • Partial Fit characteristics • Works with Write intensive as well as Read intensive contexts – Depends on pre configured parameter according to a priori studies • Unfit characteristics • 24/7 operation – Has to stop operation to allow changes to pre configuration parameters • Don’t manage dynamic incidental changes to any number of nodes • Not fully automatic
Network Cache Method Characteristics 9 Node 21 9 Node 3 9 9 Node 20 9 9
Data Trading Characteristics Node 8 B B B Node 3 Node 6 A A C C Node 3 D
Simulator Validation Teoric vs. Simulator calibration curves.