1 / 55

Principles of Data Management V - Distributed Databases

This course delves into the principles of distributed databases, covering data management, design issues, query optimization, and transaction processing. Topics include relational and distributed models, normalization, query processing, and advanced distributed systems concepts.

cbose
Download Presentation

Principles of Data Management V - Distributed Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Temple University – CIS Dept.CIS661 – Principles of Data Management V. Megalooikonomou Distributed Databases (based on slides by Silberchatz,Korth, and Sudarshan and slides by C. Faloutsos at CMU )

  2. General Overview - rel. model • Relational model - SQL • Functional Dependencies & Normalization • Physical Design; Indexing • Query optimization • Transaction processing • Advanced topics • Distributed Databases

  3. Problem – definition centralized DB: CHICAGO LA NY

  4. Problem – definition Distr. DB: DB stored in many places ... connected NY LA

  5. Problem – definition now: connect to LA; exec sql select * from EMP; ... connect to NY; exec sql select * from EMPLOYEE; ... NY LA EMPLOYEE DBMS2 EMP DBMS1

  6. Problem – definition ideally: connect to distr-LA; exec sql select * from EMPL; LA NY EMPLOYEE D-DBMS D-DBMS EMP DBMS2 DBMS1

  7. Pros + Cons • Pros • Data sharing • reliability & availability • speed up of query processing • Cons • software development cost • more bugs • may increase processing overhead (msg)

  8. Overview • Problem – motivation • Design issues • Query optimization – semijoins • transactions (recovery, conc. control)

  9. Design of Distr. DBMS what are our choices of storing a table?

  10. Design of Distr. DBMS • replication • fragmentation (horizontal; vertical; hybrid) • both

  11. Design of Distr. DBMS vertical fragm. horiz. fragm.

  12. Transparency & autonomy Issues/goals: • naming and local autonomy • replication and fragmentation transp. • location transparency i.e.:

  13. Problem – definition ideally: connect to distr-LA; exec sql select * from EMPL; LA NY EMPLOYEE D-DBMS D-DBMS EMP DBMS2 DBMS1

  14. Overview • Problem – motivation • Design issues • Query optimization – semijoins • transactions (recovery, conc. control)

  15. Distributed Query processing • issues (additional to centralized q-opt) • cost of transmission • parallelism / overlap of delays • (cpu, disk, #bytes-transmitted, #messages-trasmitted) • minimize elapsed time? • or minimize resource consumption?

  16. Distr. Q-opt – semijoins S2 SUPPLIER SHIPMENT S1 S3 SUPPLIER Join SHIPMENT = ?

  17. semijoins • choice of plans? • plan #1: ship SHIP -> S1; join; ship -> S3 • plan #2: ship SHIP->S3; ship SUP->S3; join • ... • others?

  18. Distr. Q-opt – semijoins S2 SUPPLIER SHIPMENT S1 S3 SUPPLIER Join SHIPMENT = ?

  19. Semijoins • Idea: reduce the tables before shipping SUPPLIER SHIPMENT S1 S3 SUPPLIER Join SHIPMENT = ?

  20. Semijoins • How to do the reduction, cheaply? • Eg., reduce ‘SHIPMENT’:

  21. Semijoins • Idea: reduce the tables before shipping SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 S3 SUPPLIER Join SHIPMENT = ?

  22. Semijoins • Formally: • SHIPMENT’ = SHIPMENT SUPPLIER • express semijoin w/ rel. algebra

  23. Semijoins • Formally: • SHIPMENT’ = SHIPMENT SUPPLIER • express semijoin w/ rel. algebra

  24. Semijoins – eg: • suppose each attr. is 4 bytes • Q: transmission cost (#bytes) for semijoin SHIPMENT’ = SHIPMENT semijoin SUPPLIER

  25. Semijoins • Idea: reduce the tables before shipping SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 S3 4 bytes SUPPLIER Join SHIPMENT = ?

  26. Semijoins – eg: • suppose each attr. is 4 bytes • Q: transmission cost (#bytes) for semijoin SHIPMENT’ = SHIPMENT semijoin SUPPLIER • A: 4*4 bytes

  27. Semijoins – eg: • suppose each attr. is 4 bytes • Q1: give a plan, with semijoin(s) • Q2: estimate its cost (#bytes shipped)

  28. Semijoins – eg: • A1: • reduce SHIPMENT to SHIPMENT’ • SHIPMENT’ -> S3 • SUPPLIER -> S3 • do join @ S3 • Q2: cost?

  29. 4 bytes 4 bytes 4 bytes 4 bytes Semijoins SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 S3

  30. Semijoins – eg: • A2: • 4*4 bytes - reduce SHIPMENT to SHIPMENT’ • 3*8 bytes - SHIPMENT’ -> S3 • 4*8 bytes - SUPPLIER -> S3 • 0 bytes - do join @ S3 72 bytes TOTAL

  31. Other plans?

  32. Other plans? P2: • reduce SHIPMENT to SHIPMENT’ • reduce SUPPLIER to SUPPLIER’ • SHIPMENT’ -> S3 • SUPPLIER’ -> S3

  33. Other plans? P3: • reduce SUPPLIER to SUPPLIER’ • SUPPLIER’ -> S2 • do join @ S2 • ship results -> S3

  34. A brilliant idea: two-way semijoins • (not in book, not in final exam) • reduce both relations with one more exchange: [Kang, ’86] • ship back the list of keys that didn’t match • CAN NOT LOSE! (why?) • further improvement: • or the list of ones that matched – whatever is shorter!

  35. Two-way Semijoins S2 SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 (s5,s11) S3

  36. Overview • Problem – motivation • Design issues • Query optimization – semijoins • transactions (recovery, conc. control)

  37. Transactions – recovery • Problem: eg., a transaction moves $100 from NY -> $50 to LA, $50 to Chicago • 3 sub-transactions, on 3 systems • how to guarantee atomicity (all-or-none)? • Observation: additional types of failures (links, servers, delays, time-outs ....)

  38. Transactions – recovery • Problem: eg., a transaction moves $100 from NY -> $50 to LA, $50 to Chicago

  39. Distributed recovery CHICAGO How? T1,2: +$50 NY LA NY T1,3: +$50 T1,1:-$100

  40. Distributed recovery CHICAGO Step1: choose coordinator T1,2: +$50 NY LA NY T1,3: +$50 T1,1:-$100

  41. Distributed recovery • Step 2: execute a protocol, eg., “2 phase commit” when a transaction T completes execution (i.e., when all sites at which T has executed inform the transaction coordinator Ci that T has completed) Ci starts the 2PC protocol ->

  42. 2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit time

  43. 2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit Y Y time

  44. 2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit Y Y commit time

  45. 2 phase commit (eg., failure) T1,3 T1,1 (coord.) T1,2 prepare to commit time

  46. 2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit Y N time

  47. 2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit Y N abort time

  48. Distributed recovery • Many, many additional details (what if the coordinator fails? what if a link fails? etc) • and many other solutions (eg., 3-phase commit)

  49. Overview • Problem – motivation • Design issues • Query optimization – semijoins • transactions (recovery, conc. control)

  50. Distributed conc. control • also more complicated: • distributed deadlocks!

More Related