560 likes | 720 Views
ICS 214B: Transaction Processing and Distributed Data Management. Lecture 8: Distributed Database Systems Professor Chen Li. Application SQL Front End Query Processor Transaction Proc. File Access. So far: Centralized DB systems. P. Software:. M. Simplifications:
E N D
ICS 214B: Transaction Processing and Distributed Data Management Lecture 8: Distributed Database Systems Professor Chen Li
Application SQL Front End Query Processor Transaction Proc. File Access So far: Centralized DB systems P Software: M ... • Simplifications: • single front end • one place to keep locks • if processor fails, system fails, ... Notes 08
Next: distributed database systems • Multiple processors ( + memories) • Heterogeneity and autonomy of “components” Notes 08
Why do we need Distributed Databases? • Example: Big Corp. has offices in London, New York, and Hong Kong. • Employee data: • EMP(ENO, NAME, TITLE, SALARY, …) • Where should the employee data table reside? Notes 08
Big Corp. Data Access Pattern • Mostly, employee data is managed at the office where the employee works • E.g., payroll, benefits, hire and fire • Periodically, Big Corp needs consolidated access to employee data • E.g., Big Corp. changes benefit plans and that affects all employees. • E.g., Annual bonus depends on global net profit. Notes 08
Internet New York Payroll app London Payroll app EMP London New York Hong Kong Payroll app Problem: NY and HK payroll apps run very slowly! Hong Kong Notes 08
Internet New York Payroll app London Payroll app London Emp NY Emp London New York Hong Kong Payroll app Much better!! Hong Kong HK Emp Notes 08
Internet New York Payroll app London Payroll app Annual Bonus app London Emp NY Emp London New York Hong Kong Payroll app Distribution provides opportunities for parallel execution Hong Kong HK Emp Notes 08
Internet New York Payroll app London Payroll app Annual Bonus app London Emp NY Emp London New York Hong Kong Payroll app Hong Kong HK Emp Notes 08
Internet New York Payroll app London Payroll app Annual Bonus app Lon, NY Emp NY, HK Emp London New York Hong Kong Payroll app Replication improves availability Hong Kong HK, Lon Emp Notes 08
Heterogeneity and Autonomy Application Stock ticker tape Files RDBMS Portfolio History of dividends, ratios,... Notes 08
We will study data management with multiple processors and possible autonomy, heterogeneity • Impact on: • Data organization • Query processing • Access structures • Concurrency control • Recovery Notes 08
We also study transaction monitors • Coordinate transaction execution • Multiple DBMSs • High performance • Have workflow facilities • Manage communications with client “terminals” Notes 08
DB architectures (1) Shared memory P P P ... M Notes 08
P P P M DB architectures (2) Shared disk ... M M ... Notes 08
P P P M M M DB architectures (3) Shared nothing ... Notes 08
... ... P P P P P P M M DB architectures (4) Hybrid example – Hierarchical or Clustered Notes 08
Issues for selecting architecture • Reliability • Scalability • Geographic distribution of data • Data “clusters” • Performance • Cost Notes 08
Parallel or distributed DB system? • More similarities than differences! Notes 08
Typically, parallel DBs: • Fast interconnect • Homogeneous software • High performance is goal • Transparency is goal Notes 08
Typically, distributed DBs: • Geographically distributed • Data sharing is goal (may run into heterogeneity, autonomy) • Disconnected operation possible Notes 08
Distributed Database Challenges • Distributed Database Design • Deciding what data goes where • Depends on data access patterns of major applications • Two subproblems: • Fragmentation: partition tables into fragments • Allocation: allocate fragments to nodes Notes 08
Distributed Database Challenges • Distributed Query Processing • Centralized query plan goal: minimize number of disk I/Os • Additional factors in distributed scenario: • Communication costs • Opportunity for parallelism • Space of possible query plans is much larger! Notes 08
Distributed Database Challenges • Distributed Concurrency Control • Transactions span nodes • Must be globally serializable • Two main approaches: • Locking • Timestamps • Distributed Deadlock Management • Multiple data copies – need to be kept in sync when updates occur Notes 08
Distributed Database Challenges • Reliability of Distributed Databases • Centralized database failure model: • processor fails • Distributed database failure model: • One or more processors may fail • Network may fail • Network may be partitioned • Data must be kept in sync Notes 08
To illustrate synchronization problems: “Two Generals” Problem Notes 08
The one general problem (Trivial!) Battlefield G Troops Notes 08
The two general problem: <-------------------------------> Blue army Red army Enemy Blue G Red G messengers Notes 08
Rules: • Blue and red army must attack at same time • Blue and red generals synchronize through messengers • Messengers can be lost Notes 08
Distributed Database Challenges • Heterogeneity Application Stock ticker tape Files RDBMS Portfolio History of dividends, ratios,... Notes 08
Distributed Database Challenges • Autonomy Example: unable to get statistics for query optimization Example: blue general may have mind of his (or her) own! Notes 08
Next Topic • Distributed DB Design Notes 08
Distributed DB Design Top-down approach: - have DB… - how to split and allocate the sites Bottom-up approach: - multi-database (possibly heterogeneous, autonomous) - no design issues! Notes 08
Two issues in DDB design: • Fragmentation • Allocation Note: issues not independent, but will cover separately Notes 08
Motivation: Two sites: Sa, Sb Qa Qb Sa Sb Employee relation E (#,name,loc,sal,…) 40% of queries: 40% of queries: Qa: select * Qb: select * from E from E where loc=Sa where loc=Sb and… and ... Notes 08
5 Joe Sa 10 # NM Loc Sal E 7 Sally Sb 25 8 Tom Sa 15 .. .. F # NM Loc Sal # NM Loc Sal 5 Joe Sa 10 7 Sally Sb 25 .. 8 Tom Sa 15 .. At Sb At Sa Notes 08
F = { F1, F2 } F1 = loc=Sa(E) F2 = loc=Sb(E) called primary horizontal fragmentation Notes 08
Fragmentation • Horizontal Primary depends on local attributes R Derived depends on foreign relation • Vertical R Notes 08
Used mostly in parallel dbs Used in parallel dbs and distributed dbs Three common horizontal fragmentation techniques • Round robin • Hash partitioning • Range partitioning Notes 08
Round robin R D0 D1 D2 t1 t1 t2 t2 t3 t3 t4 t4 ... t5 • Evenly distributes data • Good for scanning full relation • Not good for point or range queries • Not suitable for databases distributed over WAN Notes 08
Hash partitioning R D0 D1 D2 t1h(k1)=2 t1 t2h(k2)=0 t2 t3h(k3)=0 t3 t4h(k4)=1 t4 ... • Good for point queries on key; also for joins on key • Not good for range queries; point queries not on key • If hash function good, even distribution • Not suitable for databases distributed over a WAN Notes 08
Range partitioning partitioning vector R D0 D1 D2 t1: A=5 t1 t2: A=8 t2 t3: A=2 t3 t4: A=3 t4 ... 4 7 V0 V1 • Good for point queries on A; also for joins on A • Good for some range queries on A • Need to select good vector: else unbalanced • data skew, execution skew Notes 08
Which are good fragmentations? Example: F = { F1, F2 } F1 = sal<10 E F2 = sal>20 E Problem: Some tuples lost! Notes 08
Which are good fragmentations? Second example: F = { F3, F4 } F3 = sal<10 E F4 = sal>5 E Tuples with 5 < sal < 10 are duplicated... Notes 08
Better design Example: F = { F5, F6, F7 } F5 = sal 5 E F6 = 5<sal<10 EF7 = sal 10 E Then replicate F6 if convenient (part of allocation problem) Notes 08
Desired properties for fragmentation R F = {F1, F2, …, Fn} • Completeness • For every data item x R, FiFsuch that xFi • Disjointness • xFi, Fj such that xFj, i j • Reconstruction • There is function g such that R = g(F1, F2, …, Fn) Notes 08
Fi FiF Desired properties for horizontal fragmentation R F = {F1, F2, …, Fn} • Completeness • For every tuple tR, FiFsuch that tFi • Disjointness • tFi, Fj such that tFj, i j • Reconstruction – can safely ignore • Completeness R = Notes 08
How do we get completeness and disjointness? (1) Check it “manually”! e.g., F1 = sal<10 E ; F2 = sal10 E Notes 08
How do we get completeness and disjointness? (2) “Automatically” generate fragments with these properties • Horizontal fragments are defined by selection predicates • Generate a set of selection predicates with the desired properties Notes 08
Example of generation • Say queries use predicates: A<10, A>5, Loc = SA, Loc = SB • Next: - generate “minterm” predicates - eliminate useless ones • Given simple predicates Pr= { p1, p2,.. pn } • minterm predicates are of the form • p1* p2* … pn* • where pk* is pk or is ¬pk Notes 08