A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu1, Patrick P. C. Lee2, Liping Xiang1,Yinlong Xu1, Lingling Gao1 1University of Science and Technology of China2The Chinese University of Hong KongDSN’12

Fault Tolerance • Fault tolerance becomes more challenging in modern distributed storage systems • Increase in scale • Usage of inexpensive but less reliable storage nodes • Fault tolerance is ensured by introducing redundancy across storage nodes • Replication • Erasure codes (e.g., Reed-Solomon codes) A A A B B B A B A+B A+2B

XOR-Based Erasure Codes • Encoding/decoding involve XOR operations only • Low computational overhead • Different redundancy levels • 2-fault tolerant: RDP, EVENODD, X-Code • 3-fault tolerant: STAR • General-fault tolerant: Cauchy Reed-Solomon (CRS)

Failure Recovery • Recovering node failures is necessary • Preserve the required redundancy level • Avoid data unavailability • Single-node failure recovery • Single-node failure occurs more frequently than a concurrent multi-node failure

Example: Recovery in RDP • An RDP code example with 8 nodes node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 ⊕ d0,0 d0,1 d0,2 d0,3 d0,4 d0,5 d0,6 d0,7 ⊕ ⊕ d1,0 d1,1 d1,2 d1,3 d1,4 d1,5 d1,6 d1,7 ⊕ ⊕ d2,0 d2,1 d2,2 d2,3 d2,4 d2,5 d2,6 d2,7 ⊕ ⊕ d3,0 d3,1 d3,2 d3,3 d3,4 d3,5 d3,6 d3,7 ⊕ ⊕ d4,0 d4,1 d4,2 d4,3 d4,4 d4,5 d4,6 d4,7 ⊕ ⊕ d5,0 d5,1 d5,2 d5,3 d5,4 d5,5 d5,6 d5,7 ⊕ Let’s say node0 fails. How do we recover node0?

Conventional Recovery • Idea: useonly row parity sets. Recover each lost data symbol (i.e., data chunk) independently node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 Different metrics can be used to measure the efficiency of a recovery scheme Read symbols:36 Then how do we recover node 0 efficiently?

Minimize Number of Read Symbols • Idea: use a combination of row and diagonal parity sets to maximize overlapping symbols[Xiang, ToS’11] node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 Read symbols:27 Improve rate: 25%

Need A New Metric? • A modern storage system is natural to be composed of heterogeneous types of storage nodes • System upgrades • New node addition • A heterogeneous environment node 1 node 2 node 0 node 3 109Mbps 68Mbps 26Mbps New node 110Mbps Proxy Need a new efficient failure recovery solution for heterogeneous environment! 113Mbps 86Mbps 110Mbps 10Mbps node4 node 7 node 6 node 5

Related Work • Hybrid recovery • Minimize number of read symbols RAID-6 XOR-based erasure codes • e.g., RDP [Xiang, ToS’11], EVENODD [Wang, Globecom’10 • Enumeration recovery [Khan, FAST’12] • Enumerate all recovery possibilities to achieve optimal recovery for general XOR-based erasure codes • Greedy recovery [Zhu, MSST’12] • Efficient search of recovery solutions for general XOR-based erasure codes • Regenerating codes [Dimakis, ToIT’10] • Nodes encode data during recovery • Minimize recovery bandwidth • Heterogeneous case considered in [Li, Infocom’10], but requires node encoding and collaboration

Challenges • How to enable efficient failure recovery for heterogeneous settings? • Minimizing # of read symbols  homogeneous settings • Performance bottlenecked by poorly performed nodes • How to quickly find the recovery strategy? • Minimizing # of read symbols  deterministic metric • Minimizing general cost  non-deterministic metric Recovery decision typically can’t be pre-determined

Our Contributions • Target two RAID-6 codes: RDP and EVENODD • XOR-based encoding operations • Goals: • Minimize search time • Minimize recovery cost Cost-based single-node failure recovery for heterogeneous distributed storage systems

Our Contributions • Formulate an optimization problem for single-node failure recovery in heterogeneous settings • Propose a cost-based heterogeneous recovery (CHR)algorithm • Narrow down search space • Suitable for online recovery • Implement and experiment on a heterogeneous networked storage testbed

Model Formulation • Our formulation: Nodek Node0 Node1 Node p-1 Node p vp-1 vp Node : v0 v1 vk . . . . . . Weight: w0 w1 wp-1 wp . . . . . . Download Distribution: . . . . . . y0 y1 yp-1 yp Minimizing total recovery cost:

Physical Meanings

Solving the Model • Important: Which symbols to be fetched from surviving nodes must follow inherent rules of specific coding schemes • To solve the model, we introduce recovery sequence (x0 , x1 , … , xp-2, 0) • xi = 0 , di,k is recovered from its row parity set • xi = 1 , di,k is recovered from its diagonal parity set 1) Each recovery sequence represents a feasible recovery solution; 2) Download distribution can be represented by recovery sequence; • An example: node 0 node 1 node 2 node 3 node 4 node 5 • recovery sequence: (0, 0, 1, 1, 0) d0,0 d0,1 d0,2 d0,3 d0,4 d0,5 d1,0 d1,1 d1,2 d1,3 d1,4 d1,5 • download distribution: • (3, 2, 2, 3, 2) d2,0 d2,1 d2,2 d2,3 d2,4 d2,5 d3,0 d3,1 d3,2 d3,3 d3,4 d3,5

Solving the Model (2) • Step 1: use recovery sequence to represent downloads • Step 2: narrow down search space by only considering min-read recovery sequences (i.e., download minimum number of read symbols during recovery) • Step 3: reformulate the model as Minimize

Expensive Enumeration Challenge: Too many min-read recovery sequences to enumerate even we narrow down search space Observation: many min-read recovery sequences return the same download distribution

Optimize Enumeration Process • Two conditions under which different recovery sequences have same download distribution: • Shift condition (0, 0, 0, 1, 1, 1, 0)  (0, 0, 1, 1, 1, 0, 0)  (0, 1, 1, 1, 0, 0, 0)  (1, 1, 1, 0, 0, 0, 0) … • Reverse condition (0, 0, 0, 1, 1, 1, 0)  (0, 1, 1, 1, 0, 0, 0) Key idea: not all recovery sequences need to be enumerated (details in the paper)

Cost-based Heterogeneous Recovery (CHR) Algorithm: Intuition • Step 1: initialize a bitmap to track all possible min-read recovery sequences R • Step 2: compute recovery cost of R. • Step 3: mark all shifted and reverse sequences of R as being enumerated • Step 4: switch to another R; return the one with minimum cost

Example node 1 node 2 node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 node 0 node 3 109Mbps 68Mbps 26Mbps New node 110Mbps node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 Proxy 113Mbps 86Mbps 110Mbps 10Mbps node4 Our proposed CHR algorithm Hybrid approach [Xiang, ToS’11] node 7 3 5 4 4 5 3 3 node 6 node 5 5 4 3 3 4 5 3

Recovery Cost Comparison • CHR approach • Hybrid approach • Conventional approach reduce by 25.89% reduce by 40.91%

Simulation Studies (1): Traverse Efficiency • Evaluate the computational time of CHR CHR significantly reduces the traverse time of the naive approach by over 90% as p increases!

Simulation Studies (2): Robustness Efficiency • Evaluate if CHR achieves the global optimal among all the feasible recovery sequences CHR has a very high probability (over 93%) to hit the global optimal recovery cost!

Simulation Studies (3): Recovery Efficiency • Evaluate via 100 runs for each p the recovery efficiency of CHR in a heterogeneous storage environment • CHR can reduce recovery cost by up to 50% over the conventional approach • CHR can reduce recovery cost by up to 30% over the hybrid approach

Experiments • Experiments on a networked storage testbed • Conventional vs. Hybrid vs. CHR • Default chunk size = 1MB • Communication via ATA over Ethernet (AoE) • Consider two codes: RDP and EVENODD • Only RDP results shown in this talk • Recovery operation: • Read chunks from surviving nodes • Reconstruct lost chunks • Write reconstructed chunks to a new node nodes Gigabit switch Recovery process

Experiments • Two types of Ethernet interface card equipped by physical storage devices • 100Mbps  set weight = 1/(100Mbps) • 1Gbps  set weight = 1/(1Gbps) Configuration for RDP code

Different Number of Storage Nodes • Total recovery time for RDP • CHR improves conventional by 21-31% • CHR improves hybrid by 15-20%

Different Chunk Size • Total recovery time for RDP (p = 11) • CHR improves conventional by 18-26% • CHR improves hybrid by 14-19%

Different Failed Nodes • Total recovery time for RDP (p = 11) • CHR still outperforms conventional and hybrid

Conclusions • Address single-node failure recovery RAID-6 coded heterogeneous storage systems • Formulate a computation-efficient optimization model • Propose a cost-based heterogeneous recovery algorithm • Validate the effectiveness of the CHR algorithm through extensive simulations and testbed experiments • Future work: • Different cost formulations • Extension for general XOR-based erasure codes • Degraded reads • Source code: • http://ansrlab.cse.cuhk.edu.hk/software/chr/

Backup

Cost-based Heterogeneous Recovery (CHR) Algorithm Notation: Algorithm:

Example node 1 node 2 node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 node 0 node 3 109Mbps 68Mbps 26Mbps New node 110Mbps Proxy 113Mbps 86Mbps 110Mbps 10Mbps node4 node 7 3 5 4 4 5 3 3 node 6 node 5 Step 1: Initialize F[0..63] with 0-bits, R = {1110000}, the recovery cost C = MAX_VALUE Step 2:F[7]=1, mark R’s shifted and reverse recovery sequences: F[56]=F[28]=F[14]=1; Calculate the recovery cost for R, C will be 0.7353α; R*, C* will be updated by R, C Step 3: Get the next min-read recovery sequence R and go to Step 2 Step 4: Finally, we can find that R* = {1010100} and C* = 0.5449α

Recovery Cost Comparison • CHR approach • Hybrid approach • Conventional approach node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 reduce by 25.89% reduce by 40.91% 5 4 3 3 4 5 3

Different Number of Storage Nodes • Consider the overall performance of the complete recovery operation for EVENODD

Different Chunk Size • Evaluate the impact of chunk size for EVENODD on the recovery time performance

Different Failed Nodes • Evaluate the recovery time performance for EVENODD when the failed node is in a different column

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes

Presentation Transcript

Storage System: RAID

Self-repairing Homomorphic Codes for Distributed Storage Systems [1]

Network Coding for Distributed Storage Systems

Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding

Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems

Flexible, Wide-Area Storage for Distributed Systems with WheelFS

A Distributed Privacy-Preserving Scheme for Location-Based Queries

A Fast Repair Code Based on Regular Graphs for Distributed Storage Systems

(Distributed) (Structured) Storage Systems

Cooperative regenerating codes for distributed storage systems

Cost Framework for a Heterogeneous Distributed Semi-structured Environment

Turbo codes based error correction scheme for dimmable visible light communication systems

Raid storage

BASIC Regenerating Codes for Distributed Storage System s

Fountain Codes Based Distributed Storage Algorithms for Large-scale Wireless Sensor Networks

Compound Codes for Optimal Repair in Distributed Storage

Raid Recovery

raid data recovery

HBA Distributed Metadata Management for Large Cluster Based Storage Systems

Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems

Lecture 11: Storage Systems Disk, RAID, Dependability

Lecture 11: Storage Systems Disk, RAID, Dependability