1 / 51

Cooperative regenerating codes for distributed storage systems

Cooperative regenerating codes for distributed storage systems. Kenneth Shum (Joint work with Yuchong Hu ) 22nd July 2011. Multiple node failures. Large-scale storage system Google data center, example from Kannan’s talk. 800000 servers, fail rate = 4% per year Repair in 2 days

conan
Download Presentation

Cooperative regenerating codes for distributed storage systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cooperative regenerating codes for distributed storage systems Kenneth Shum (Joint work with YuchongHu) 22nd July 2011

  2. Multiple node failures • Large-scale storage system • Google data center, example from Kannan’s talk. • 800000 servers, fail rate = 4% per year • Repair in 2 days • Mean number of failed servers in 2 days = 175. • The lazy-repair policy in TotalRecall • A repair process is triggered only after the number of failed nodes has reached a certain threshold. kshum

  3. Jointly repair multiple failures Storage nodes Newcomers Data exchange Can we further reduce therepair-bandwidth? Hu et al. (JSAC, Feb 2010) kshum

  4. Distributed storage (erasure coding) Wu, Dimakis ISIT09 A1 A2 A1, A2, B1, B2 Data Collector B1 B2 A1+B1 2 A2+B2 2 A1+B1 A2+B2 kshum

  5. Naive Repair A1 A2 A1 A2 B1, B2 A1, A2, B1, B2 B1 B2 A1+B1, 2 A1+B2 4 packets required. A1+B1 2 A2+B2 2 A1+B1 A2+B2 kshum

  6. Repair with ``code alignment’’ A1 A2 A1 A2 B1+ B2 A1, A2, B1, B2 B1 B2 A1+2 A2+B1+ B2 2 A1+ A2+B1+ B2 3 packets required. A1+B1 2 A2+B2 Solve: P1 = A1+2 A2 P2 = 2 A1+ A2 2 A1+B1 A2+B2 kshum

  7. 2 packets B1 B2 2 packets 2 packets 2 A1+B1 A2+B2 2 packets Multiple failures, separate repair 8 packets in total 4 packets per newcomer A1 A2 A1, A2, B1, B2 B1 B2 A1+B1 2 A2+B2 2 A1+B1 A2+B2 kshum

  8. Multiple failures, cooperative repair (I) 6 packets in total 3 packets per newcomer A1 A2 2 A1+B1 A2+B2 A1, A2, B1, B2 B1 B2 B1 B2 A1, A2 A1+B1 2 A2+B2 B1,B2 2A2+B2A1+B1 2 A1+B1 A2+B2 kshum

  9. A1 A1 A1+B1 A2 A1+B1 2A2+B2 A2 2A2+B2 Multiple failures, cooperative repair (II) 6 packets in total 3 packets per newcomer A1 A2 A1, A2, B1, B2 B1 B1 B2 B2 2A1+B1 A1+B1 2 A2+B2 B2 2A1+B1 2 A1+B1 A2+B2 A2+B2 kshum

  10. Outline of the talk • Is it optimal in terms of repair-bandwidth? • What is the tradeoff between storage and repair-bandwidth for cooperative repair? • Can we achieve the Pareto-optimal operating points on the tradeoff curve by linear network coding? • Exact repair • Functional repair kshum

  11. Information flow graph    In1 Out1 Out6 In6 Mid6 2   In2 Out2 1 2   Out7 In7 Mid7 1   1  S In3 Out3 1  DataCollector   1 In4 Out4  1   In5 Out5 kshum

  12. A1 A1 A1+B1 A2 A1+B1 2A2+B2 A2 2A2+B2 Is this regenerating code optimal ? 6 packets in total 3 packets per newcomer A1 A2 A1, A2, B1, B2 A1 B1 B2 B2 2A1+B1 A1+B1 2 A2+B2 B2 2A1+B1 2 A1+B1 A2+B2 A2+B2 kshum

  13. First cut     In1 Out1 Out6 In6 Mid6 2 1   In2 Out2 2   In7 Out7 1 Mid7    B 1 In3 Out3  1 DataCollector   In4 Out4 B 4 1 kshum

  14. Second cut   2 Out1 Out1 Mid1 In1  DataCollector 2 1  Out2 2 2 1 In2 Out2 Mid2   1 Out3 1 1  1  Out4 In3 Out3 Mid3 2 2  In4 Out4 Mid4  B 2+1+ 2 kshum

  15. A linear programming problem • Minimize 21+ 2 (repair bandwidth) • Subject to 4  41 4  2+1 + 2 1 , 2  0 2 1 1 1 1  1 2  1  At least 3 packets kshum

  16. Non-homogeneous download traffic     In1 Out1 Out6 In6 Mid6 2 a   In2 Out2 2   In7 Out7 b Mid7    B c In3 Out3  d DataCollector   In4 Out4 B a +b +c +d kshum

  17. Non-homogeneous traffic   2 Out1 Out1  Mid1 DataCollector In1 2 1  e Out2 2 2 1 In2 Out2 Mid2   f 1 Out3 f  g 1 h In3 Out3 Mid3 B 2+f +j  i Out4 j  In4 Out4 Mid4  kshum

  18. Non-homogeneous traffic   2 Out1  Out1 Mid1 DataCollector In1 2 1  e Out2 2 2 1 In2 Out2 Mid2    f 1 Out3 f g 1 h In3 Out3 Mid3 B 2+f +j B 2+h +i  i Out4 j  In4 Out4 Mid4  kshum

  19. Non-homogeneous traffic   2 Out1 Out1 Mid1 DataCollector In1  2 1  e Out2 2 2 1 In2 Out2 Mid2   f 1 Out3 f  g 1 h In3 Out3 Mid3 B 2+f +j B 2+h +iB 2+e +j  i Out4 j  In4 Out4 Mid4  kshum

  20. Non-homogeneous traffic   2 Out1 Out1 Mid1 DataCollector In1 2  1  e Out2 2 2 1 In2 Out2 Mid2    f 1 Out3 f g 1 h In3 Out3 Mid3 B 2+f +j B 2+h +iB 2+e +j B 2+g +i  i Out4 j  In4 Out4 Mid4  kshum

  21. The same LP problem • Minimize • Subject to 1 1  At least 3 packets kshum

  22. TRADEOFF BETWEENSTORAGE AND REPAIR-BANDWIDTH kshum

  23. Storage vs Repair-bandwidth (S., ICC 2011, Kermarrec, Le Scouamec and Straub, Netcod 2011.) File size = 420d = 8 k = 4 One-by-one repair Repairing 3 newcomers jointly d k  DC kshum

  24. Fair comparison? repair degree = 8 Cooperative repair One-by-one repair Surviving nodes Surviving nodes Number of connectionsper each newcomer = 8 Number of connectionsper each newcomer = 8+2 kshum

  25. MBCR and MSCR Minimum bandwidthcooperative repair (MBCR) One-by-one repair Cooperative repair Minimum storagecooperative repair (MSCR) kshum

  26. How much can we improve? File size = 2275 d = 30 k = 5 One-by-one repair When d is large, joint repair does not have significant advantage over one-by-one repair. Repairing 10 newcomers jointly d k  DC kshum

  27. How much can we improve? File size = 616 d = 8 k = 4 One-by-one repair Repairing 10 newcomers jointly Repair-bandwidth reductionis more prominent when d is not so large. d k  DC kshum

  28. AN EXPLICIT CONSTRUCTION FOR MINIMUM-BANDWIDTHCOOPERATIVE REPAIR kshum

  29. An explicit construction for MBCR (S., Hu, ISIT 2011.) Require d = k, r = n–d • B = 8 information packets • n = 4 nodes • Each node stores 5 packets. • Repair r = 2 failures simultaneously • No. of connections for each DC = k=2 • No. of helpers for each failed node =d=2 • Minimum repair-bandwidth • Storage per node kshum

  30. Min-Bandwidth point One-by-one repair Repairing 2 new nodes cooperatively kshum

  31. Data Distribution XOR A, B, C, D, F+G C, D, E, F, H+A 8 data packets: A, B, C, D, E, F, G, H E, F, G, H, B+C G, H, A, B, D+E 5 packets: 4 systematic, 1 parity-check kshum

  32. Data collection A, B, C, D, F+G A, B, C, D C, D, E, F, H+A Data collector E, F, G, H, B+C E, F, G, H A,B,C,D,E,F,G,H G, H, A, B, D+E kshum

  33. Data collection Data collector A, B, C, D, F+G A, B, C, F+G C, D, E, F, H+A D, E, F, H+A A B C D E F G H A B E, F, G, H, B+C C D Triangular, Full-rank E G, H, A, B, D+E F F+G H+A kshum

  34. Exact Repair How to repair? A, B, C, D, F+G A B C D F+G C, D, E, F, H+A B+C F+G E, F, G, H, B+C E F G H B+C G, H, A, B, D+E Total repair-bandwidth=10 kshum

  35. How to repair? Exact Repair A, B, C, D, F+G C, D, E, F, H+A C D E F H+A D+E E E F E, F, G, H, B+C F+G F G E F H B+C G, H, A, B, D+E Total repair-bandwidth=10 kshum

  36. Min-Bandwidth point One-by-one repair Repairing 2 new nodes cooperatively kshum

  37. AN EXPLICIT CONSTRUCTION FOR MINIMUM-STORAGE COOPERATIVE REPAIR kshum

  38. An explicit construction for MSCR Require d = k (S. ICC 2011.) • Minimum repair-bandwidth • Storage per node • B = 6 information packets • n nodes • Each node stores 2 packets. • Repair r = 2 failures simultaneously • No. of connections for each DC = k=3 • No. of helpers for each failed node =d=3 kshum

  39. The min-storage point 3  3 DC Non-cooperative k=3,d=3, r =2,B=6 storage cost per node = 2repair bandwidth per node = 4 Cooperative kshum

  40. Data retrieval MDS code with dimension k=3 Source data codeword encode codeword =2 …… Storage nodes Data collector decode kshum

  41. Repair : phase 1 Source data codeword encode codeword lost lost Storage nodes newcomers decode decode kshum

  42. Repair: phase 2 codeword encode codeword Storage nodes lost lost Repair bandwidth per node= 8/2 = 4 newcomers Re-encode Re-encode exchange kshum

  43. The construction is optimal 3  3 DC Non-cooperative k=3,d=3, r =2,B=6 storage cost per node = 2repair bandwidth per node = 4 Cooperative kshum

  44. EXISTENCE OF COOPERATIVE REGENERATING CODES UNDER FUNCTIONAL REPAIR kshum

  45. Existence of optimal linear regenerating codes in general (S., Hu, Netcod 2011.) • Sustainable storage system • Will it work after arbitrarily many repairs? • Technical difficulty: The information flow graph is unbounded. • Can we work over a fixed finite field, for unlimited number of regenerations? • Yes if we can construct an exact regenerating code. • The answer is also “yes” for cooperative functional repair in general. kshum

  46. Trellis structure … … … … Stage 2 Stage 1 Stage 0 mT0T1 mT0 T0 is the “transfer matrix” in stage 0 m T1 is the “transfer matrix” in stage 1 mT0T1T2 Message vector(row vector) T2 is the “transfer matrix” in stage 2 kshum

  47. Flow in information flow graph  4 5 5  In1 Mid1 Out1 DC Out1 0 1 5 2 2 1 1 1 2 3 5 0 5 2 In2 Out2 Mid2 Out2 S 4   2 2 4 5 2 3 2 2 4 2  5 In3 Out3 Mid3 Out3 Out3 2 1 1 0 5 4 2 2 1 1 The cut-set bound says that the cut capacity is at least 8. Can we constructa flow with value 8? 5 In4 Out4 Mid4 Out4 Out4  kshum

  48. Cross-sectional flow pattern  5 4 0 5 4 In1 Mid1 Out1 DC Out1 0 1 5 2 2 1 1 1 2 4 3 5 0 0 5 2 3 In2 Out2 Mid2 Out2 S 4   2 2 4 5 3 2 2 2 4 2  0 4 0 In1 Out3 Mid1 Out1 Out3 2 1 1 0 5 4 2 2 1 1 0 4 0 5 In2 Out4 Mid2 Out2 Out4  kshum

  49. A recursive construction of flow Stage s Stage s+1 Identify a set of cross-section flow pattern, say H. For any cross-section flow pattern (h1, h2, h3, h4) in H stage s+1, we can find a flow in this segment of graph, such that (g1, g2, g3, g4) is also in H. Each pattern corresponds to a submatrix of the transfer matrix. By Schwartz-Zippel lemma, we can find the local encoding vectors so that all such determinants are non-zero, if the finite field is sufficiently large. g1 h1 In1 Mid1 Out1 g2 h2 In2 Mid2 Out2 h3 g3 Out3 Out3 h4 g4 Out4 Out4 kshum

  50. Summary • Multiple node failures in medium-scale to large-scale storage system • Formulation as a linear program • Functional repair: Linear regenerating code over fixed finite field which matches the cut-set bound on repair-bandwidth exists. • Exact repair: two families of explicit code constructions • Minimum-bandwidth point: d=k, r = n – d • Minimum-storage point: d=k, r arbitrary kshum

More Related