590 likes | 612 Views
Explore design issues, query processing, and transparent autonomy in distributed databases. Discuss replication, fragmentation, and distributed query processing techniques. Understand semijoins and distributed query optimization.
E N D
Temple University – CIS Dept.CIS616– Principles of Data Management V. Megalooikonomou Distributed Databases (based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU)
Overview • Problem – motivation • Design issues • Query processing – semijoins • transactions (recovery, conc. control)
Problem – definition centralized DB: CHICAGO LA NY
Problem – definition distributed DB: DB stored in many places (cites) ... connected NY LA
Problem – definition distributed DB: stored in many connected cites DB typically geographically separated separately administered transactions are differentiated: Local Global possibly different DBMSs, DB schemas (heterogeneous) NY LA
Problem – definition now: connect to LA; exec sql select * from EMP; ... connect to NY; exec sql select * from EMPLOYEE; ... NY LA EMPLOYEE DBMS2 EMP DBMS1
Problem – definition ideally: connect to D-DBMS; exec sql select * from EMPL; LA NY EMPLOYEE D-DBMS D-DBMS EMP DBMS2 DBMS1
Pros + Cons • Pros • data sharing • reliability & availability • autonomy (local) • speed up of query processing • Cons • software development cost • more bugs • increased processing overhead (msg)
Overview • Problem – motivation • Design issues • Query processing – semijoins • transactions (recovery, conc. control)
Design of Distr. DBMS • Homogeneous distr. DBs • Identical DBMS • Same DB Schema • Aware of one another • Agree to cooperate in transaction processing • Heterogeneous distr. DBs • Different DBMS • Different DB Schema • May not be aware of one another • May provide limited facilities for cooperation in transaction processing
Design of Distr. DBMS what are our choices of storing a table?
Design of Distr. DBMS • replication (several copies of a table at different sites) • fragmentation (horizontal; vertical; hybrid) • or both…
Design of Distr. DBMS • Replication: a copy of a relation is stored in two or more sites • Pros and cons • Availability • Increased parallelism (possible minimization of movement of data among sites) • Increased overhead on update (replicas should be consistent)
Design of Distr. DBMS vertical fragm. Fragmentation: horiz. fragm. • keep tuples/attributes at the sites where they are used the most • ensure that the table can be reconstructed
Transparency & autonomy Issues/goals: • naming and local autonomy • replication transparency • fragmentation transparency • location transparency i.e.:
Problem – definition ideally: connect to D-DBMS; exec sql select * from EMPL; LA NY EMPLOYEE D-DBMS D-DBMS EMP DBMS2 DBMS1
Overview • Problem – motivation • Design issues • Query processing – semijoins • transactions (recovery, conc. control)
Distributed Query processing • issues (additional to centralized q-opt) • cost of transmission • parallelism / overlap of delays • (cpu, disk, #bytes-transmitted, #messages-transmitted) • minimize elapsed time? • or minimize resource consumption?
Distributed Query processing S2 SUPPLIER SHIPMENT S1 S3 SUPPLIER Join SHIPMENT = ?
semijoins • choice of plans? • plan #1: ship SHIP -> S1; join; ship -> S3 • plan #2: ship SHIP->S3; ship SUP->S3; join • ... • others?
Distr. Q-opt – semijoins S2 SUPPLIER SHIPMENT S1 S3 SUPPLIER Join SHIPMENT = ?
Semijoins • Idea: reduce the tables before shipping SUPPLIER SHIPMENT S1 S3 SUPPLIER Join SHIPMENT = ?
Semijoins • How to do the reduction, cheaply? • E.g., reduce ‘SHIPMENT’:
Semijoins • Idea: reduce the tables before shipping SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 S3 SUPPLIER Join SHIPMENT = ?
Semijoins • Formally: • SHIPMENT’ = SHIPMENT SUPPLIER • express semijoin w/ rel. algebra
Semijoins • Formally: • SHIPMENT’ = SHIPMENT SUPPLIER • express semijoin w/ rel. algebra
Semijoins – e.g.: • suppose each attr. is 4 bytes • Q: transmission cost (#bytes) for semijoin SHIPMENT’ = SHIPMENT semijoin SUPPLIER
Semijoins • Idea: reduce the tables before shipping SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 S3 4 bytes SUPPLIER Join SHIPMENT = ?
Semijoins – e.g.: • suppose each attr. is 4 bytes • Q: transmission cost (#bytes) for semijoin SHIPMENT’ = SHIPMENT semijoin SUPPLIER • A: 4*4 bytes
Semijoins – e.g.: • suppose each attr. is 4 bytes • Q1: give a plan, with semijoin(s) • Q2: estimate its cost (#bytes shipped)
Semijoins – e.g.: • A1: • reduce SHIPMENT to SHIPMENT’ • SHIPMENT’ -> S3 • SUPPLIER -> S3 • do join @ S3 • Q2: cost?
4 bytes 4 bytes 4 bytes 4 bytes Semijoins SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 S3
Semijoins – e.g.: • A2: • 4*4 bytes - reduce SHIPMENT to SHIPMENT’ • 3*8 bytes - SHIPMENT’ -> S3 • 4*8 bytes - SUPPLIER -> S3 • 0 bytes - do join @ S3 72 bytes TOTAL
Other plans? P2: • reduce SHIPMENT to SHIPMENT’ • reduce SUPPLIER to SUPPLIER’ • SHIPMENT’ -> S3 • SUPPLIER’ -> S3
Other plans? P3: • reduce SUPPLIER to SUPPLIER’ • SUPPLIER’ -> S2 • do join @ S2 • ship results -> S3
A brilliant idea: two-way semijoins • (not in book, not in final exam) • reduce both relations with one more exchange: [Kang, ’86] • ship back the list of keys that didn’t match • CAN NOT LOSE! (why?) • further improvement: • or the list of ones that matched – whatever is shorter!
Two-way Semijoins S2 SUPPLIER SHIPMENT (s1,s2,s5,s11) S1 (s5,s11) S3
Overview • Problem – motivation • Design issues • Query processing – semijoins • transactions(recovery, conc. control)
Transactions – recovery • Problem: e.g., a transaction moves $100 from NY $50 to LA, $50 to Chicago • 3 sub-transactions, on 3 systems • how to guarantee atomicity (all-or-none)? • Observation: additional types of failures (links, servers, delays, time-outs ....)
Transactions – recovery • Problem: e.g., a transaction moves $100 from NY -> $50 to LA, $50 to Chicago
Distributed recovery CHICAGO How? T1,2: +$50 NY LA NY T1,3: +$50 T1,1:-$100
Distributed recovery CHICAGO Step1: choose coordinator T1,2: +$50 NY LA NY T1,3: +$50 T1,1:-$100
Distributed recovery • Step 2: execute a commit protocol, e.g., “2 phase commit” when a transaction T completes execution (i.e., when all sites at which T has executed inform the transaction coordinator Ci that T has completed) Ci starts the 2PC protocol ->
2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit time
2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit Y Y time
2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit Y Y commit time
2 phase commit (e.g., failure) T1,3 T1,1 (coord.) T1,2 prepare to commit time
2 phase commit T1,3 T1,1 (coord.) T1,2 prepare to commit Y N time