CS 347: Parallel and Distributed Data Management Notes02: Distributed DB Design

CS 347: Parallel and DistributedData ManagementNotes02: Distributed DB Design Hector Garcia-Molina Notes 02

Chapter 5 Ozsu & Valduriez Distributed DB Design Top-down approach: - have DB… - how to split and allocate the sites Multi-DBs (or bottom-up): no design issues! Notes 02

Two issues in DDB design: • Fragmentation • Allocation Note: issues not independent, but will cover separately Notes 02

Example Employee relation E (#,name,loc,sal,…) 40% of queries: 40% of queries: Qa: select * Qb: select * from E from E where loc=Sa where loc=Sb and… and ... Notes 02

Motivation: Two sites: Sa, Sb Qa  Qb Sa Sb Example Employee relation E (#,name,loc,sal,…) 40% of queries: 40% of queries: Qa: select * Qb: select * from E from E where loc=Sa where loc=Sb and… and ... Notes 02

It does not take a rocket scientist to figure out fragmentation... Notes 02

5 Joe Sa 10 # NM Loc Sal E 7 Sally Sb 25 8 Tom Sa 15 .. .. F # NM Loc Sal # NM Loc Sal 5 Joe Sa 10 7 Sally Sb 25 .. 8 Tom Sa 15 .. At Sb At Sa Notes 02

F = { F1, F2 } F1 =  loc=SaE F2 =  loc=SbE Notes 02

F = { F1, F2 } F1 =  loc=SaE F2 =  loc=SbE  called primary horizontal fragmentation Notes 02

Fragmentation • Horizontal Primary depends on local attributes R Derived depends on foreign relation • Vertical R Notes 02

Fragmentation • Horizontal Primary depends on local attributes R Derived depends on foreign relation • Vertical R Fragmentation also called Sharding Notes 02

Three common horizontal partitioning techniques • Round robin • Hash partitioning • Range partitioning Notes 02

Round robin R D0 D1 D2 t1 t1 t2 t2 t3 t3 t4 t4 ... t5 • Evenly distributes data • Good for scanning full relation • Not good for point or range queries Notes 02

Hash partitioning R D0 D1 D2 t1h(k1)=2 t1 t2h(k2)=0 t2 t3h(k3)=0 t3 t4h(k4)=1 t4 ... • Good for point queries on key; also for joins • Not good for range queries; point queries not on key • If hash function good, even distribution Notes 02

Range partitioning R D0 D1 D2 t1: A=5 t1 t2: A=8 t2 t3: A=2 t3 t4: A=3 t4 ... • Good for some range queries on A • Need to select good vector: else unbalance  data skew  execution skew partitioning vector 4 7 V0 V1 Notes 02

Which are good fragmentations? Example: F = { F1, F2 } F1 = sal<10 E F2 = sal>20 E Notes 02

Which are good fragmentations? Example: F = { F1, F2 } F1 = sal<10 E F2 = sal>20 E  Problem: Some tuples lost! Notes 02

Which are good fragmentations? Second example: F = { F3, F4 } F3 = sal<10 E F4 = sal>5 E Notes 02

Which are good fragmentations? Second example: F = { F3, F4 } F3 = sal<10 E F4 = sal>5 E  Tuples with 5 < sal < 10 are duplicated... Notes 02

Prefer to deal with replication explicitly Example: F = { F5, F6, F7 } F5 = sal  5 E F6 = 5< sal <10 EF7 = sal  10 E  Then replicate F6 if convenient (part of allocation problem) Notes 02

Desired properties for horizontal fragmentation R F ={ F1, F2, … } (1) Completeness t  R,  Fi  Fsuch that t Fi Notes 02

(2) Disjointness t  Fi,  Fj such that tFj, i  j, Fi, FjF (3) Reconstruction - ignore Notes 02

How do we get completeness and disjointness? (1) Check it “manually”! e.g., F1 =  sal<10 E ; F2 =  sal10 E Notes 02

How do we get completeness and disjointness? (2) “Automatically” generate fragments with these properties Desired simple predicates  Fragments Notes 02

Example of generation • Say queries use predicates: A<10, A>5, Loc = SA, Loc = SB • Next: - generate “minterm” predicates - eliminate useless ones Notes 02

Minterm predicates (part I) (1) A<10  A>5  Loc=SA Loc=SB (2) A<10  A>5  Loc=SA ¬(Loc=SB) (3) A<10  A>5  ¬(Loc=SA)  Loc=SB (4) A<10  A>5  ¬(Loc=SA)  ¬(Loc=SB) (5) A<10  ¬(A>5)  Loc=SA Loc=SB (6) A<10  ¬(A>5)  Loc=SA ¬(Loc=SB) (7) A<10  ¬(A>5)  ¬(Loc=SA)  Loc=SB (8) A<10  ¬(A>5)  ¬(Loc=SA)  ¬(Loc=SB) Notes 02

5 < A < 10 A  5 Minterm predicates (part I) (1) A<10  A>5  Loc=SA Loc=SB (2) A<10  A>5  Loc=SA ¬(Loc=SB) (3) A<10  A>5  ¬(Loc=SA)  Loc=SB (4) A<10  A>5  ¬(Loc=SA)  ¬(Loc=SB) (5) A<10  ¬(A>5)  Loc=SA Loc=SB (6) A<10  ¬(A>5)  Loc=SA ¬(Loc=SB) (7) A<10  ¬(A>5)  ¬(Loc=SA)  Loc=SB (8) A<10  ¬(A>5)  ¬(Loc=SA)  ¬(Loc=SB) Notes 02

Minterm predicates (part II) (9) ¬(A<10)  A>5  Loc=SA Loc=SB (10) ¬(A<10)  A>5  Loc=SA¬(Loc=SB) (11) ¬(A<10)  A>5 ¬(Loc=SA)  Loc=SB (12) ¬(A<10)  A>5 ¬(Loc=SA) ¬(Loc=SB) (13) ¬(A<10) ¬(A>5)  Loc=SA Loc=SB (14) ¬(A<10) ¬(A>5)  Loc=SA¬(Loc=SB) (15) ¬(A<10) ¬(A>5) ¬(Loc=SA)  Loc=SB (16) ¬(A<10) ¬(A>5) ¬(Loc=SA) ¬(Loc=SB) Notes 02

A  10 Minterm predicates (part II) (9) ¬(A<10)  A>5  Loc=SA Loc=SB (10) ¬(A<10)  A>5  Loc=SA¬(Loc=SB) (11) ¬(A<10)  A>5 ¬(Loc=SA)  Loc=SB (12) ¬(A<10)  A>5 ¬(Loc=SA) ¬(Loc=SB) (13) ¬(A<10) ¬(A>5)  Loc=SA Loc=SB (14) ¬(A<10) ¬(A>5)  Loc=SA¬(Loc=SB) (15) ¬(A<10) ¬(A>5) ¬(Loc=SA)  Loc=SB (16) ¬(A<10) ¬(A>5) ¬(Loc=SA) ¬(Loc=SB) Notes 02

Final fragments: F2: 5 < A < 10 Loc=SA F3: 5 < A < 10  Loc=SB F6: A  5  Loc=SA F7: A  5 Loc=SB F10: A  10  Loc=SA F11: A  10 Loc=SB Notes 02

Note: elimination of useless fragments depends on application semantics: e.g.: if LOC could be  SA,  SB, we need to add fragments F4: 5 <A <10 Loc  SA Loc  SB F8: A  5 Loc  SA Loc  SB F12: A  10  Loc  SA Loc  SB Notes 02

Why does this work? Predicates: p1  p2  p3  p4 p1  p2  p3  ¬p4 ¬ p1  ¬p2  ¬p3  ¬p4 ... Notes 02

(1) Completeness: Take t  R pi(t) must be T or F! Say p1(t) =T p2(t) = T p3(t) =F p4(t) =F Then t is in fragment with predicate p1  p2  ¬p3  ¬p4 Notes 02

(2) Disjointness Say t  Fragment p1 p2  ¬p3  ¬p4 Then: p1(t) = T, p2(t) = T, p3(t) = F, p4(t)= F  t cannot be in any other fragment! Notes 02

Summary • Given simple predicates Pr= { p1, p2,.. pm } minterm predicates are M={m | m = pk*, 1  k m } where pk* is pk or is ¬ pk pkPr • Fragments m R for all m  M are • complete and disjoint Notes 02

Another Desired Fragmentation Property:Match Access Patterns data A try to place in same fragment frequently accessed together data B data C Notes 02

Return to example: E(#, NM, LOC, SAL,…) Common queries: Qa: select * Qb: select * from E from E where LOC=Sa where LOC=Sb and … and ... Notes 02

Three choices: (1) Pr = { } F1 ={ E } (2) Pr = {LOC=Sa, LOC=Sb} F2={  loc=Sa E,  loc=Sb E } (3) Pr = {LOC=Sa, LOC=Sb, Sal<10} F3={ loc=Sa  sal<10 E, loc=Sa  sal10 E, loc=Sb  sal<10E, loc=Sb  sal10 E } Notes 02

In other words: Qa: Select … loc = Sa ... Loc=Sa  sal < 10 Qb: Select … loc = Sb ... Loc=Sa  sal  10 F3 F1 F2 Loc=Sb  sal < 10 Loc=Sb  sal  10 Notes 02

F2 is good… (not F1 , F3 ) In other words: Qa: Select … loc = Sa ... Loc=Sa  sal < 10 Qb: Select … loc = Sb ... Loc=Sa  sal  10 F3 F1 F2 Loc=Sb  sal < 10 Loc=Sb  sal  10 Notes 02

Derived horizontal fragmentation Example: E(#, NM, SAL, LOC)F={ E1, E2} by LOC J(#, DES,…) Common query for project: [Given employee name, list projects (s)he works in] Notes 02

E2 E1 (at Sa) (at Sb) J Notes 02

J1 = J E1 J2 = J E2 E2 E1 (at Sa) (at Sb) J2 J1 Notes 02

Derived horizontal fragmentation F could be primary or derived R, F = { F1, F2, ... Fn}  S, D = {D1, D2, …Dn} where Di =S Fi Convention: R is owner S is member Notes 02

Checking completeness and disjointness of derived fragmentation Example: Say J is:  But no #= 33in E1 nor in E2! This J tuple will not be in J1 nor J2 Fragmentation not complete Notes 02

To get completeness Need to enforce referential integrity constraint: join attr(#) of member relation  joint attr(#) of owner relation Notes 02

Example: E2 E1 Fragmentation is not disjoint! J J1 J2 Notes 02

To get disjointness Join attribute(#) should be key of owner relation Notes 02

Summary: horizontal fragmentation • Type: primary, derived • Properties: completeness, disjointness Notes 02

CS 347: Parallel and Distributed Data Management Notes02: Distributed DB Design