ICS 214B: Transaction Processing and Distributed Data Management

ICS 214B: Transaction Processing and Distributed Data Management Lecture 8: Distributed Database Systems Professor Chen Li

Application SQL Front End Query Processor Transaction Proc. File Access So far: Centralized DB systems P Software: M ... • Simplifications: • single front end • one place to keep locks • if processor fails, system fails, ... Notes 08

Next: distributed database systems • Multiple processors ( + memories) • Heterogeneity and autonomy of “components” Notes 08

Why do we need Distributed Databases? • Example: Big Corp. has offices in London, New York, and Hong Kong. • Employee data: • EMP(ENO, NAME, TITLE, SALARY, …) • Where should the employee data table reside? Notes 08

Big Corp. Data Access Pattern • Mostly, employee data is managed at the office where the employee works • E.g., payroll, benefits, hire and fire • Periodically, Big Corp needs consolidated access to employee data • E.g., Big Corp. changes benefit plans and that affects all employees. • E.g., Annual bonus depends on global net profit. Notes 08

Internet New York Payroll app London Payroll app EMP London New York Hong Kong Payroll app Problem: NY and HK payroll apps run very slowly! Hong Kong Notes 08

Internet New York Payroll app London Payroll app London Emp NY Emp London New York Hong Kong Payroll app Much better!! Hong Kong HK Emp Notes 08

Internet New York Payroll app London Payroll app Annual Bonus app London Emp NY Emp London New York Hong Kong Payroll app Distribution provides opportunities for parallel execution Hong Kong HK Emp Notes 08

Internet New York Payroll app London Payroll app Annual Bonus app London Emp NY Emp London New York Hong Kong Payroll app Hong Kong HK Emp Notes 08

Internet New York Payroll app London Payroll app Annual Bonus app Lon, NY Emp NY, HK Emp London New York Hong Kong Payroll app Replication improves availability Hong Kong HK, Lon Emp Notes 08

Heterogeneity and Autonomy Application Stock ticker tape Files RDBMS Portfolio History of dividends, ratios,... Notes 08

We will study data management with multiple processors and possible autonomy, heterogeneity • Impact on: • Data organization • Query processing • Access structures • Concurrency control • Recovery Notes 08

We also study transaction monitors • Coordinate transaction execution • Multiple DBMSs • High performance • Have workflow facilities • Manage communications with client “terminals” Notes 08

DB architectures (1) Shared memory P P P ... M Notes 08

P P P M DB architectures (2) Shared disk ... M M ... Notes 08

P P P M M M DB architectures (3) Shared nothing ... Notes 08

... ... P P P P P P M M DB architectures (4) Hybrid example – Hierarchical or Clustered Notes 08

Issues for selecting architecture • Reliability • Scalability • Geographic distribution of data • Data “clusters” • Performance • Cost Notes 08

Parallel or distributed DB system? • More similarities than differences! Notes 08

Typically, parallel DBs: • Fast interconnect • Homogeneous software • High performance is goal • Transparency is goal Notes 08

Typically, distributed DBs: • Geographically distributed • Data sharing is goal (may run into heterogeneity, autonomy) • Disconnected operation possible Notes 08

Distributed Database Challenges • Distributed Database Design • Deciding what data goes where • Depends on data access patterns of major applications • Two subproblems: • Fragmentation: partition tables into fragments • Allocation: allocate fragments to nodes Notes 08

Distributed Database Challenges • Distributed Query Processing • Centralized query plan goal: minimize number of disk I/Os • Additional factors in distributed scenario: • Communication costs • Opportunity for parallelism • Space of possible query plans is much larger! Notes 08

Distributed Database Challenges • Distributed Concurrency Control • Transactions span nodes • Must be globally serializable • Two main approaches: • Locking • Timestamps • Distributed Deadlock Management • Multiple data copies – need to be kept in sync when updates occur Notes 08

Distributed Database Challenges • Reliability of Distributed Databases • Centralized database failure model: • processor fails • Distributed database failure model: • One or more processors may fail • Network may fail • Network may be partitioned • Data must be kept in sync Notes 08

 To illustrate synchronization problems: “Two Generals” Problem Notes 08

The one general problem (Trivial!)  Battlefield G Troops Notes 08

The two general problem: <-------------------------------> Blue army Red army Enemy Blue G Red G messengers Notes 08

Rules: • Blue and red army must attack at same time • Blue and red generals synchronize through messengers • Messengers can be lost Notes 08

Distributed Database Challenges • Heterogeneity Application Stock ticker tape Files RDBMS Portfolio History of dividends, ratios,... Notes 08

Distributed Database Challenges • Autonomy Example: unable to get statistics for query optimization Example: blue general may have mind of his (or her) own! Notes 08

Next Topic • Distributed DB Design Notes 08

Distributed DB Design Top-down approach: - have DB… - how to split and allocate the sites Bottom-up approach: - multi-database (possibly heterogeneous, autonomous) - no design issues! Notes 08

Two issues in DDB design: • Fragmentation • Allocation Note: issues not independent, but will cover separately Notes 08

Motivation: Two sites: Sa, Sb Qa  Qb Sa Sb Employee relation E (#,name,loc,sal,…) 40% of queries: 40% of queries: Qa: select * Qb: select * from E from E where loc=Sa where loc=Sb and… and ... Notes 08

5 Joe Sa 10 # NM Loc Sal E 7 Sally Sb 25 8 Tom Sa 15 .. .. F # NM Loc Sal # NM Loc Sal 5 Joe Sa 10 7 Sally Sb 25 .. 8 Tom Sa 15 .. At Sb At Sa Notes 08

F = { F1, F2 } F1 = loc=Sa(E) F2 = loc=Sb(E)  called primary horizontal fragmentation Notes 08

Fragmentation • Horizontal Primary depends on local attributes R Derived depends on foreign relation • Vertical R Notes 08

Used mostly in parallel dbs Used in parallel dbs and distributed dbs Three common horizontal fragmentation techniques • Round robin • Hash partitioning • Range partitioning Notes 08

Round robin R D0 D1 D2 t1 t1 t2 t2 t3 t3 t4 t4 ... t5 • Evenly distributes data • Good for scanning full relation • Not good for point or range queries • Not suitable for databases distributed over WAN Notes 08

Hash partitioning R D0 D1 D2 t1h(k1)=2 t1 t2h(k2)=0 t2 t3h(k3)=0 t3 t4h(k4)=1 t4 ... • Good for point queries on key; also for joins on key • Not good for range queries; point queries not on key • If hash function good, even distribution • Not suitable for databases distributed over a WAN Notes 08

Range partitioning partitioning vector R D0 D1 D2 t1: A=5 t1 t2: A=8 t2 t3: A=2 t3 t4: A=3 t4 ... 4 7 V0 V1 • Good for point queries on A; also for joins on A • Good for some range queries on A • Need to select good vector: else unbalanced • data skew, execution skew Notes 08

Which are good fragmentations? Example: F = { F1, F2 } F1 = sal<10 E F2 = sal>20 E  Problem: Some tuples lost! Notes 08

Which are good fragmentations? Second example: F = { F3, F4 } F3 = sal<10 E F4 = sal>5 E  Tuples with 5 < sal < 10 are duplicated... Notes 08

Better design Example: F = { F5, F6, F7 } F5 = sal  5 E F6 = 5<sal<10 EF7 = sal  10 E  Then replicate F6 if convenient (part of allocation problem) Notes 08

Desired properties for fragmentation R F = {F1, F2, …, Fn} • Completeness • For every data item x  R,  FiFsuch that xFi • Disjointness • xFi,  Fj such that xFj, i  j • Reconstruction • There is function g such that R = g(F1, F2, …, Fn) Notes 08

Fi FiF Desired properties for horizontal fragmentation R F = {F1, F2, …, Fn} • Completeness • For every tuple tR,  FiFsuch that tFi • Disjointness • tFi,  Fj such that tFj, i  j • Reconstruction – can safely ignore • Completeness  R = Notes 08

How do we get completeness and disjointness? (1) Check it “manually”! e.g., F1 = sal<10 E ; F2 = sal10 E Notes 08

How do we get completeness and disjointness? (2) “Automatically” generate fragments with these properties • Horizontal fragments are defined by selection predicates • Generate a set of selection predicates with the desired properties Notes 08

Example of generation • Say queries use predicates: A<10, A>5, Loc = SA, Loc = SB • Next: - generate “minterm” predicates - eliminate useless ones • Given simple predicates Pr= { p1, p2,.. pn } • minterm predicates are of the form • p1*  p2*  …  pn* • where pk* is pk or is ¬pk Notes 08

ICS 214B: Transaction Processing and Distributed Data Management