650 likes | 877 Views
Paris Dauphine University *CERIA Lab. *04th October 200 4. Contribution to the Design & Implementation of the Highly Available Scalable and Distributed Data Structure: LH* RS. Rim Moussa Rim.Moussa@dauphine.fr http://ceria.dauphine.fr/rim/rim.html. Thesis Supervisor: Pr. Witold Litwin
E N D
Paris Dauphine University *CERIA Lab. *04th October 2004 Contribution to the Design & Implementationof the Highly Available Scalable and Distributed Data Structure: LH*RS Rim Moussa Rim.Moussa@dauphine.fr http://ceria.dauphine.fr/rim/rim.html Thesis Supervisor: Pr. Witold Litwin Examinators: Pr. Thomas J.E. Schwarz Pr. Toré Risch Jury President: Pr. Gérard Lévy Thesis Presentation in Computer Science *Distributed Databases
Outline Issue State of the Art LH*RS Scheme LH*RS Manager Experimentations LH*RS File Creation Bucket Recovery Parity Bucket Creation Conclusion & Future Work R. Moussa, U. Paris Dauphine
Facts … • Volume of Information of 30% /year • Technology • Network Infrastructure >> Gilder Law, bandwidth triples every year. • Evolution of PCs storage & computing capacities >> Moore Law, the latters double every 18 months. • Bottleneck of Disks Accesses & CPUs Need of Distributed Data Storage Systems SDDSs: LH*, RP* … High Throughput R. Moussa, U. Paris Dauphine
Network Facts … • Multicomputers >> Modular Architecture >> Good Price/ Performance Tradeoff • Frequent & Costly Failures >> Stat. Published by the Contingency Planning Research in 1996: the cost of service interruption/h case of brokerage application is $6,45 million. Need of Distributed & Highly-Available Data Storage Systems R. Moussa, U. Paris Dauphine
State of the Art Data Replication (+)Good Response Time, Mirors are functional (-)High Storage Overhead (n if n repliquas) Parity Calculus • Criteria to evaluate Erasure-resilient Codes: • Encoding Rate (Parity Volume/ Data Volume) • Update Penality (Parity Volumes) • Group Size used for Data Reconstruction • Encoding & Decoding Complexity • Recovery Capabilitties R. Moussa, U. Paris Dauphine
Parity Schemes 1-Available Schemes XOR Parity Calculus : RAID Technology (level 3, 4, 5…) [PGK88], SDDS LH*g [L96] … k-Available Schemes Binary Linear Codes: [H94] Tolerate max. 3 failures Array Codes: EVENODD [B94 ], X-code [XB99], RDP [C+04] Tolerate max. 2 failures Reed Solomon Codes : IDA [R89], RAID X [W91], FEC [B95], Tutorial [P97], LH*RS [LS00, ML02, MS04, LMS04] Tolerate k failures (k > 3) … R. Moussa, U. Paris Dauphine
Outline… Issue State of the Art LH*RS Scheme LH*RS? SDDSs? Reed Solomon Codes? Encoding/ Decoding Optimizations LH*RS Manager Experimentations R. Moussa, U. Paris Dauphine
LH*RS ? LH*RS [LS00] Scalability & High Throughput LH*: Scalable & Distributed Data Structure Distribution using Linear Hashing (LH*LH [KLR96]) LH*LH Manager[B00] High Availability Parity Calculus using Reed-Solomon Codes [RS63] R. Moussa, U. Paris Dauphine
Record Transfert You Split Insertions OVERLOADED SDDSs Principles (1) Dynamic File Growth Coordinator Client Client … Network … … Data Buckets R. Moussa, U. Paris Dauphine
Client Image Adjustment Message Query Query Forward Network SDDSs Principles (2) (2) No Centralized Directory Access File Image Client … … … Cases de Données R. Moussa, U. Paris Dauphine
Reed-Solomon Codes • Encoding From m Data Symbols Calculus of n Parity Symbols • Data Representation Galois Field • Fields with finite size: q • Closure Propoerty: Addition, Substraction, Multiplication, Division. • In GF(2w), (1) Addition (XOR) (2) Multiplication (Tables: gflog and antigflog) e1 * e2 = antigflog[ gflog[e1] + gflog[e2] ] R. Moussa, U. Paris Dauphine
Parity Matrix S1 S2 S3 : Si : Sm S1 : Sm P1 P2 : Pj : Pn-m S1 S2 S3 : Si : Sm C1,j C2,j C3,j : Cm,j = P(m(n-m)) Im Pj (S1 C1,j) (S2 C2,j) … (Sm Cm,j) (1) Systematic Encoding: Matrix (Im|P) (2) Any m columns are linearly independent m-1 XORs GF m Multiplications GF RS Encoding 1 0 0 0 0 … 0 C1,1… C1,j… C1,n-m 0 1 0 0 0… 0 C2,1… C2,j … C2,n-m 0 0 1 0 0… 0 C3,1… C3,j … C3,n-m … … … … … 0 0 0 0 0 … 1 Cm,1 … Cm,j … Cm,n-m R. Moussa, U. Paris Dauphine
Optimized Decoding Multiply the ‘‘m OK symbols’’ By columns of H-1 corresponding to lost symbols Hm: mcorresponding columns Gauss Transformatiom = [ S1 S2 S3 S4 ….. Sm ] H-1 m OK symbols RS Decoding S1 S2 S3 S4 : Sm P1 P2 P3 : Pn-m 1 0 0 0 0 … 0 C1,1 C1,2 C1,3… C1,n-m 0 1 0 0 0… 0 C2,1C2,2 C2,3… C2,n-m 0 0 1 0 0… 0 C3,1C3,2 C3,3… C3,n-m … … … … … 0 0 0 0 0 … 1 Cm,1Cm,2 Cm,3… Cm,n-m R. Moussa, U. Paris Dauphine
Optimizations GF Multiplication Galois Field Parity Matrix GF(28) 1 symbol = 1 Byte GF(216) 1 symbol = 2 Bytes (+) GF(216) vs. GF(28) reduces the #Symbols by 1/2 #Operations in the GF. (-) Multiplication Tables Size GF(28): 0,768 Ko GF(216): 393,216 Ko (512 0,768) R. Moussa, U. Paris Dauphine
Optimizations (2) GF Multiplication Parity Matrix Galois Field 0001 0001 0001 … 0001eb9b2284 … 0001 22849é74 … 00019e44 d7f1 … … … … … 1st Row of ‘1’s Any update from 1st DB is processed with XOR Calculus Gain in Performance of 4% (case PB creation, m =4) 1st Column of ‘1’s Encoding of the 1st PB along XOR Calculus Gain in encoding & decoding R. Moussa, U. Paris Dauphine
Optimizations (3) GF Multiplication Parity Matrix Galois Field Goal: Reduce GF Multiplication Complexity e1 * e2 = antigflog[ gflog[e1] + gflog[e2] ] Encoding Log Pre-calculus of the Coef. of P Matrix Improvement of 3,5% Decoding Log Pre-calculus of coef. of H-1 matrix and OKsymbols vector Improvement of 4% to 8% depending on the #buckets to recover 0000 0000 0000 … 00005ab5e267 … 0000 e2670dce … 0000 784d 2b66… … … … … R. Moussa, U. Paris Dauphine
LH*RS -Parity Groups • Grouping Concept • m: #data buckets • k: #parity buckets Key r Insert Rank r 2 1 0 2 1 0 : Rank; [Key-list ]; Parity Parity Buckets : Key; Data Data Buckets A k-Acvailable Group survive to the failure of k buckets R. Moussa, U. Paris Dauphine
Outline… Issue State of the Art LH*RS Scheme LH*RS Manager Communication Gross Architecture 5. Experimentations 6.File Creation Bucket Recovery … R. Moussa, U. Paris Dauphine
Communication UDP TCP/IP “Multicast” • Individual Operations (Insert, Update, Delete, Search) • Record Recovery • Control Messages Performance R. Moussa, U. Paris Dauphine
Communication UDP TCP/IP “Multicast” Large Buffers Transfert • New Parity Buckets • Transfer Parity Update & Record (Bucket Split) • Bucket Recovery Performance & Reliability R. Moussa, U. Paris Dauphine
Communication UDP TCP/IP “Multicast” Looking for New Data/Parity Buckets Communication Multipoints R. Moussa, U. Paris Dauphine
Architecture Enhancements to SDDS2000 Architecture: (1) TCP/IP Connection Handler TCP/IP Connections are passive OPEN, RFC 793 –[ISI81], TCP/IP under Win2K Server OS [MB00] Before 1 Bucket Recovery (3,125 MB): SDDS 2000: 6,7 s SDDS2000-TCP: 2,6 s (Hardware Config.: CPU 733MhZ machines, network 100Mbps) Improvement of 60% (2) Flow Control & Message Acknowledgement (FCMA) Principle of “Sending Credit & Message Conservation until delivery” [J88, GRS97, D01] R. Moussa, U. Paris Dauphine
Architecture (2) (3) Dynamic IP Addressing Structure To tag new servers (data or parity) using Multicast: Multicast Group of Blank Parity Buckets Created Buckets Multicast Group of Blank Data Buckets Coordinator Before Pre-defined and Static IP@s Table R. Moussa, U. Paris Dauphine
Architecture (3) Network TCPListening Thread Pool of Working Threads TCP/IP Port ACK Structure Messages Queue Free Zones UDP Listening Port UDP Listening Thread Messages waiting for ACK. UDP Sending Port Not acquitted Messages … Multicast Working Thread ACK Mgmt Threads Multicast listening Thread Message Queue Multicast Listening Port R. Moussa, U. Paris Dauphine
Experimentation • Performance Evaluation * CPUTime * Communication Time • Experimental Environment * 5 Machines (Pentium IV: 1.8 GHz, RAM: 512 Mb) * Ethernet Network 1 Gbps * O.S.: Win2K Server * Tested Configuration: 1 Client, A group of 4 Data Buckets, k Parity Buckets (k = 0,1,2,3). R. Moussa, U. Paris Dauphine
Outline… Issue State of the Art LH*RS Scheme LH*RS Manager Experimentations File Creation Parity Update Performance Bucket Recovery Parity Bucket Creation R. Moussa, U. Paris Dauphine
File Creation • Client Operations Propagation of Data Record Inserts/ Updates/ Deletes to Parity Buckets. • Update: Send only –record. • Deletes: Management of Free Ranks within Data Buckets. • Data Bucket Split N1: #renaining records N2: #leaving records Parity Group of the Splitting Data Bucket N1+N2 Deletes + N1 Inserts Parity Group of the New Data Bucket N2 Inserts R. Moussa, U. Paris Dauphine
Performances Config. Client Window = 1 Client Window = 5 Max Bucket Size = 10 000 records File of 25 000 records 1 record = 104 Bytes No difference GF(28) et GF(216) (we don’t wait for ACKs between DBs and PBs) R. Moussa, U. Paris Dauphine
Performances Config. Client Window = 1 Client Window = 5 k = 0 ** k = 1 Perf. Degradation of 20% k = 1 ** k = 2 Perf. Degradation of 8% R. Moussa, U. Paris Dauphine
Performances Config. Client Window = 1 Client Window = 5 k = 0 ** k = 1 Perf. Degradation of 37% k = 1 ** k = 2 Perf. Degradation of 10% R. Moussa, U. Paris Dauphine
Outline… Issue State of the Art LH*RS Scheme LH*RS Manager Experimentations File Creation Bucket Recovery Scenario Performances 8.Parity Bucket Creation R. Moussa, U. Paris Dauphine
Scenario Failure Detection Coordinator Are you Alive? Parity Buckets Data Buckets R. Moussa, U. Paris Dauphine
Scenario (2) Waiting for Responses … Coordinator OK OK OK OK Parity Buckets Data Buckets R. Moussa, U. Paris Dauphine
Scenario (3) Searching Spare Buckets … Coordinator Wanna be Spare ? Multicast Group of Blank Data Buckets R. Moussa, U. Paris Dauphine
Scenario (4) Waiting for Replies … I would Coordinator I would I would Launch UDP Listening Launch TCP Listening, Launch Working Thredsl *Waiting for Confirmation* If Time-out elapsed cancel everything Multicast Group of Blank Data Buckets R. Moussa, U. Paris Dauphine
Scenario (5) Spare Selection Cancellation Coordinator Confirmed You are Hired Confirmed Multicast Group of Blank Data Buckets R. Moussa, U. Paris Dauphine
Scenario (6) Recovery Manager Selection Coordinator Recover Failed Buckets Parity Buckets R. Moussa, U. Paris Dauphine
Scenario (7) Query Phase Recovery Manager Send me Records of rank in [r, r+slice-1] … Parity Buckets Data Buckets Buckets participating to Recovery Spare Buckets R. Moussa, U. Paris Dauphine
Scenario (8) Reconstruction Phase Recovery Manager Requested Buffers … Parity Buckets Data Buckets Decoding Phase In // with Query Phase Buckets participating to Recovery Recovered Slices Spare Buckets R. Moussa, U. Paris Dauphine
Performances 1 DB RS Config. 1 DB XOR 2 DBs XOR vs. RS • File Info File of 125 000 records Record Size = 100 bytes Bucket Size = 31250 records 3.125 MB Group of 4 Data Buckets (m = 4), k-Available with k = 1,2,3 • Decoding * GF(216) * RS+ Decoding (RS + log Pre-calculus of H-1 and OKSymboles Vector) • Recovery per Slice(adaptative to PCs storage & computing capacities) R. Moussa, U. Paris Dauphine
0,58 Slice (from 4% to 100% of a bucket content) Total Time is almost constant Performances 1 DB RS Config. 1 DB XOR 2 DBs XOR vs. RS R. Moussa, U. Paris Dauphine
0,67 Slice (from 4% to 100% of a bucket content) Total Time is almost constant Performances 1 DB RS Config. 1 DB XOR 2 DBs XOR vs. RS R. Moussa, U. Paris Dauphine
Performances 1 DB RS Config. 1 DB XOR 2 DBs XOR vs. RS Time to Recover 1DB -XOR : 0,58 sec Time to Recover 1DB –RS : 0,67 sec XOR in GF(216) realizes a gain of 13% in Total Time (and 30% in CPU Time) R. Moussa, U. Paris Dauphine
0,9 Slice (from 4% to 100% of a bucket content) Total Time is almost constant Performances 1 DB RS XOR vs. RS 2 DBs 3 DBs Summary R. Moussa, U. Paris Dauphine
1,23 Slice (from 4% to 100% of a bucket content) Total Time is almost constant Performances 1 DB RS XOR vs. RS 2 DBs 3 DBs Summary R. Moussa, U. Paris Dauphine
Performances 1 DB RS XOR vs. RS 2 DBs 3 DBs Summary Time to Recover f Buckets f Time to Recover 1 Bucket Factorized Query Phase The + is Decoding Time & Time to send Recovered Buffers R. Moussa, U. Paris Dauphine
Performances XOR vs. RS 2 DBs 3 DBs Summary GF(28) • XOR in GF(28) improves decoding perf. of 60% compared to RS in GF(28). • RS/RS+ decoding in GF(216) realize a gain of 50% compared to decoding in GF(28). R. Moussa, U. Paris Dauphine
Outline… 1. Issue 2. State of the Art 3.LH*RS Scheme 4.LH*RS Manager 5. Experimentations 6.File Creation 7.Bucket Recovery 8.Parity Bucket Creation Scenario Performances R. Moussa, U. Paris Dauphine
Scenario Searching for a new Parity Bucket Coordinator Wanna Join Groupg ? Multicast Group of Blank Parity Buckets R. Moussa, U. Paris Dauphine
Scenario (2) Waiting for Replies … Coordinator I Would I Would I Would Launch UDP Listening Launch TCP Listening, Launch Working Thredsl *Waiting for Confirmation* If Time-out elapsed cancel everything Multicast Group of Blank Parity Buckets R. Moussa, U. Paris Dauphine