Dynamic Distributed Storage System Availability

Elastically Replicated Information Services: Jose Torres-Berrocal Dr. Bienvenido Velez-Rivera Sustaining the Availability of Distributed Storage Across Dynamic Topological Changes Research in Process Sponsored by Program for Research in Computing and Information Sciences and Engineering (PRECISE) NSF-EIA Grant 99-77071

Research Objective Develop a Method or Algorithm to dynamically sustain the availability of a distributed storage system over a desire threshold value while having topology changes.

Available Failed 1 - P P Availability Definition • Availability generally refers to the probability (P) that a system is operating correctly at any given moment. State Diagram

DefinitionDistributed Storage Cluster (DSC) Storage Node A distributed storage cluster (DSC) comprises two or more storage nodes which function in a coordinated fashion as a single storage system. 0 X0 Data Object N XN

X1 Example of a DSC failures • When a node fails, the objects it contains become unavailable • Thus the SYSTEM becomes unavailable 1 2 Failed Node X2 System Fails due to missing object DSC with No Redundancy

X1 X1 X2 Using Replication toTolerate Failures on a DSC This is what RAID’s do Failed Node X2 Object Replicas Object In Failed Node Available at Another Node DSC with Redundancy 50% No

Storage Systems Must Adapt to Changes Dynamic Changes Internet Store 24/7 operation Unattended

Availability as nodes are addedcompared to desired threshold • Adding nodes changes topology. • Topology changes could change at any time affecting availability. Availability Desirable g(#nodes) = Near Constant Threshold (Minimal tolerable availability) A(t) f(#nodes) = ? # Nodes

Road Map • State the problem • Solution design constraints • Ongoing research • Previous work compliance • Preliminary conclusions

Design Constraints for Method desirability • Distributed Storage Management • 24/7 operation • Minimal Redundancy • Works with Write intensive as well as Read intensive contexts • Minimum human intervention • Manage dynamic incidental changes due to the addition of nodes

Elastically Replicated Info Services Research Methodology • Develop a mathematical model for a Distributed Storage Cluster (DSC) • Develop simulator to derive system availability • Parameters • Mean Time to Failure (MTTF) • Provided by devices manufacturers • Object count • Node count • Redundancy • Node utilization • Test alternative algorithms

X2 X1 X1 X3 X0 X3 X4 X2 Math Model of a DSC 0 1 2 X0 3 4 5 6 7 8 DSC with 9 nodes/disks And 5 distinct objects DSC math model

Uniform Distribution algorithm • Uniform distribution. • DSC initial state. • DSC after adding one node. • DSC after adding next node. • Keep adding nodes until #nodes = #objects.

Centric algorithm • Centric. • DSC initial state. • DSC will maintain objects location as initial state while adding nodes.

Utilization vs. Availability relationship Maximum Utilization (U) Minimum Utilization (U) Uniform distribution No Disk Maximum Availability (A) Minimum Availability (A) A U ? #Nodes

Extreme Algorithm Results Availability Decreases even with the use of redundancy Availability decreases rapidly as nodes are added by using Uniform distribution Uniform distribution algorithm.

DSC Hybrid Model – Redundancy Calculation 6 out of 10 copies 10 original objects. DSC Matrix visualization – hybrid distribution.

DSC Hybrid Model –Utilization Factor Calculation 4 out of 10 nodes 2 out of 10 nodes DSC Matrix visualization – hybrid distribution.

Hybrid Algorithm Results Down Region Utilization parameter affects availability more than the Up region parameter Up dist. variable and Down dist. constant. Even though availability decreases, the family of curves follow a similar trend with no significant change Up dist. Constant and Down dist. variable.

Hybrid and Extreme Algorithms comparison Hybrid falls between Centric and Uniform in both parameters Overall utilization decreases by using Centric algorithm Hybrid plot is for u-50 d-5 at 50% red. Hybrid algorithm sustains availability longer than Uniform distribution

Current Methods to Comply With Design Constraints • Consensus Based • Cache • RAID • Data Trading

Current methods compliance with design constraints

Preliminary Conclusions • Availability decreases rapidly as nodes are added when using a constant replication value on the System and maximum usability • An ERIS type method is needed. • The utilization of the System is a counter part of the availability, meaning that at increasing utilization, decreasing availability. • What actually makes the system vulnerable in terms of utilization is that the more places where the objects can be located the more opportunity is to lose an object. • The region or group of nodes where the fewer replicas are is the predominant point of failure of the System (The chain breaks on the weakest link).

Current Methods Characteristics • Pre Dynamic Methods • Fit characteristics • Distributed Storage • Controlled Redundancy • Partial Fit characteristics • Works with Write intensive as well as Read intensive contexts – Depends on pre configured parameter according to a priori studies • Unfit characteristics • 24/7 operation – Has to stop operation to allow changes to pre configuration parameters • Don’t manage dynamic incidental changes to any number of nodes • Not fully automatic

Consensus Based Characteristics

Network Cache Method Characteristics 9 Node 21 9 Node 3 9 9 Node 20 9 9

RAID Characteristics

Data Trading Characteristics Node 8 B B B Node 3 Node 6 A A C C Node 3 D

Simulator Validation Teoric vs. Simulator calibration curves.

Dynamic Distributed Storage System Availability

Dynamic Distributed Storage System Availability

Presentation Transcript

Replicated Latin Squares

The SMART Way to Migrate Replicated Stateful Services

Statistical tests for replicated experiments

Sensitive Information in Financial Services

Mobile Replicated Data

Automatically securing web 2.0 applications through replicated execution

Distributed Systems Course Replication

Replicated Dictionary and Log

SPANStore: Cost-Effective G eo-Replicated Storage Spanning Multiple Cloud Services

Chapter 5

transactional storage for geo-replicated systems

Tolerating Latency in Replicated State Machines through Client Speculation

Replicated Binary Designs

G22.3250-001

Distributed FS, Continued

Slab Track System ÖBB–PORR Elastically Supported Track Base Plate

OSS – Enrollment Information Services

SRB Replicated Data Management for Cooperative Computing

BACKUP/MASTER: Replication

Distributed Systems Course Replication

When ethical review of experiments is unethical