1 / 59

Temple University – CIS Dept. CIS616– Principles of Data Management

Explore design issues, query processing, and transparent autonomy in distributed databases. Discuss replication, fragmentation, and distributed query processing techniques. Understand semijoins and distributed query optimization.

maness
Download Presentation

Temple University – CIS Dept. CIS616– Principles of Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Temple University – CIS Dept.CIS616– Principles of Data Management V. Megalooikonomou Distributed Databases (based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU)

  2. Overview • Problem – motivation • Design issues • Query processing – semijoins • transactions (recovery, conc. control)

  3. Problem – definition centralized DB: CHICAGO LA NY

  4. Problem – definition distributed DB: DB stored in many places (cites) ... connected NY LA

  5. Problem – definition distributed DB: stored in many connected cites DB typically geographically separated separately administered transactions are differentiated: Local Global possibly different DBMSs, DB schemas (heterogeneous) NY LA

  6. Problem – definition now: connect to LA; exec sql select * from EMP; ... connect to NY; exec sql select * from EMPLOYEE; ... NY LA EMPLOYEE DBMS2 EMP DBMS1

  7. Problem – definition ideally: connect to D-DBMS; exec sql select * from EMPL; LA NY EMPLOYEE D-DBMS D-DBMS EMP DBMS2 DBMS1

  8. Pros + Cons ?

  9. Pros + Cons • Pros • data sharing • reliability & availability • autonomy (local) • speed up of query processing • Cons • software development cost • more bugs • increased processing overhead (msg)

  10. Overview • Problem – motivation • Design issues • Query processing – semijoins • transactions (recovery, conc. control)

  11. Design of Distr. DBMS • Homogeneous distr. DBs • Identical DBMS • Same DB Schema • Aware of one another • Agree to cooperate in transaction processing • Heterogeneous distr. DBs • Different DBMS • Different DB Schema • May not be aware of one another • May provide limited facilities for cooperation in transaction processing

  12. Design of Distr. DBMS what are our choices of storing a table?

  13. Design of Distr. DBMS • replication (several copies of a table at different sites) • fragmentation (horizontal; vertical; hybrid) • or both…

  14. Design of Distr. DBMS • Replication: a copy of a relation is stored in two or more sites • Pros and cons • Availability • Increased parallelism (possible minimization of movement of data among sites) • Increased overhead on update (replicas should be consistent)

  15. Design of Distr. DBMS vertical fragm. Fragmentation: horiz. fragm. • keep tuples/attributes at the sites where they are used the most • ensure that the table can be reconstructed

  16. Transparency & autonomy Issues/goals: • naming and local autonomy • replication transparency • fragmentation transparency • location transparency i.e.:

  17. Problem – definition ideally: connect to D-DBMS; exec sql select * from EMPL; LA NY EMPLOYEE D-DBMS D-DBMS EMP DBMS2 DBMS1

  18. Overview • Problem – motivation • Design issues • Query processing – semijoins • transactions (recovery, conc. control)

  19. Distributed Query processing • issues (additional to centralized q-opt) • cost of transmission • parallelism / overlap of delays • (cpu, disk, #bytes-transmitted, #messages-transmitted) • minimize elapsed time? • or minimize resource consumption?

  20. Distributed Query processing S2 SUPPLIER SHIPMENT S1 S3 SUPPLIER Join SHIPMENT = ?

  21. semijoins • choice of plans? • plan #1: ship SHIP -> S1; join; ship -> S3 • plan #2: ship SHIP->S3; ship SUP->S3; join • ... • others?

  22. Distr. Q-opt – semijoins S2 SUPPLIER SHIPMENT S1 S3 SUPPLIER Join SHIPMENT = ?

  23. Semijoins • Idea: reduce the tables before shipping SUPPLIER SHIPMENT S1 S3 SUPPLIER Join SHIPMENT = ?

  24. Semijoins • How to do the reduction, cheaply? • E.g., reduce ‘SHIPMENT’:

  25. Semijoins • Idea: reduce the tables before shipping SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 S3 SUPPLIER Join SHIPMENT = ?

  26. Semijoins • Formally: • SHIPMENT’ = SHIPMENT SUPPLIER • express semijoin w/ rel. algebra

  27. Semijoins • Formally: • SHIPMENT’ = SHIPMENT SUPPLIER • express semijoin w/ rel. algebra

  28. Semijoins – e.g.: • suppose each attr. is 4 bytes • Q: transmission cost (#bytes) for semijoin SHIPMENT’ = SHIPMENT semijoin SUPPLIER

  29. Semijoins • Idea: reduce the tables before shipping SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 S3 4 bytes SUPPLIER Join SHIPMENT = ?

  30. Semijoins – e.g.: • suppose each attr. is 4 bytes • Q: transmission cost (#bytes) for semijoin SHIPMENT’ = SHIPMENT semijoin SUPPLIER • A: 4*4 bytes

  31. Semijoins – e.g.: • suppose each attr. is 4 bytes • Q1: give a plan, with semijoin(s) • Q2: estimate its cost (#bytes shipped)

  32. Semijoins – e.g.: • A1: • reduce SHIPMENT to SHIPMENT’ • SHIPMENT’ -> S3 • SUPPLIER -> S3 • do join @ S3 • Q2: cost?

  33. 4 bytes 4 bytes 4 bytes 4 bytes Semijoins SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 S3

  34. Semijoins – e.g.: • A2: • 4*4 bytes - reduce SHIPMENT to SHIPMENT’ • 3*8 bytes - SHIPMENT’ -> S3 • 4*8 bytes - SUPPLIER -> S3 • 0 bytes - do join @ S3 72 bytes TOTAL

  35. Other plans?

  36. Other plans? P2: • reduce SHIPMENT to SHIPMENT’ • reduce SUPPLIER to SUPPLIER’ • SHIPMENT’ -> S3 • SUPPLIER’ -> S3

  37. Other plans? P3: • reduce SUPPLIER to SUPPLIER’ • SUPPLIER’ -> S2 • do join @ S2 • ship results -> S3

  38. A brilliant idea: two-way semijoins • (not in book, not in final exam) • reduce both relations with one more exchange: [Kang, ’86] • ship back the list of keys that didn’t match • CAN NOT LOSE! (why?) • further improvement: • or the list of ones that matched – whatever is shorter!

  39. Two-way Semijoins S2 SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 (s5,s11) S3

  40. Overview • Problem – motivation • Design issues • Query processing – semijoins • transactions(recovery, conc. control)

  41. Transactions – recovery • Problem: e.g., a transaction moves $100 from NY  $50 to LA, $50 to Chicago • 3 sub-transactions, on 3 systems • how to guarantee atomicity (all-or-none)? • Observation: additional types of failures (links, servers, delays, time-outs ....)

  42. Transactions – recovery • Problem: e.g., a transaction moves $100 from NY -> $50 to LA, $50 to Chicago

  43. Distributed recovery CHICAGO How? T1,2: +$50 NY LA NY T1,3: +$50 T1,1:-$100

  44. Distributed recovery CHICAGO Step1: choose coordinator T1,2: +$50 NY LA NY T1,3: +$50 T1,1:-$100

  45. Distributed recovery • Step 2: execute a commit protocol, e.g., “2 phase commit” when a transaction T completes execution (i.e., when all sites at which T has executed inform the transaction coordinator Ci that T has completed) Ci starts the 2PC protocol ->

  46. 2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit time

  47. 2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit Y Y time

  48. 2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit Y Y commit time

  49. 2 phase commit (e.g., failure) T1,3 T1,1 (coord.) T1,2 prepare to commit time

  50. 2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit Y N time

More Related