1 / 53

Distributed Databases (DDBs)

Distributed Databases (DDBs). Chap. 25. Distributed Databases. Distributed Systems goal: to offer local DB autonomy at geographically distributed locations Multiple CPU's – each has DBMS, but data distributed. Advantages of DDBs.

naiya
Download Presentation

Distributed Databases (DDBs)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Databases (DDBs) Chap. 25

  2. Distributed Databases • Distributed Systems goal: • to offer local DB autonomy at geographically distributed locations • Multiple CPU's – each has DBMS, but data distributed

  3. Advantages of DDBs • Distributed nature of some DB applications (bank branches) • Increased reliability and availability if site failure - also replicate data at > 1 site • Data sharing but also local control • Improved performance - smaller DBs exist at each site • Easier expansion

  4. Disadvantages: Increased complexity Additional functions needed: • global vs. local queries • access remote sites, transmit queries and data • keep track of data and replication • execution strategies if data at > 1 site • which copy to access • maintain consistency of copies • recover from site crashes

  5. Architectures – parallel vs distributed • Distributed Systems goal:  to offer local DB autonomy at geographically distributed locations • Parallel Systems goal:  to construct a faster centralized computer • Improve performance through parallelization • Distribution of data governed by performance • Processing, I/O simultaneously

  6. Parallel DBSs • Shared-memory multiprocessor • get N times as much work with N CPU's access • MIMD, SIMD - equal access to same data, massively parallel • Parallel shared nothing • data split among CPUs, each has own CPU, divide work for transactions, communicate over high speed networks                 LANs - homogeneous machines                 CPU + memory - called a site

  7. Distributed DBSs (DDBS) • Distributed DB - share nothing • lower communication rates • WAN • heterogeneous machines • Homogeneous DDBS • homogeneous – same DBMSs • Heterogeneous DDBS • different DBMSs - need ODBC, standard SQL

  8.    Heterogeneous distributed DBSs HDDBs • Data distributed and each site has own DBMS • ORACLE at one site, DB2 at another, etc. • need ODBC, standard SQL • usually transaction manager responsible for cooperation among sites • must coordinate distributed transaction • need data conversion and to access data at other sites

  9. Federated DB - FDBS • federated DB is a multidatabase that is autonomous • collection of cooperating DBSs that are heterogeneous • preexisting DBs form new database • Each DB specifies import/export schema (view) • keeps a partial view of total schema • Each DB has its own local users, local transparency and DBA • appears centralized for local autonomous users • appears distributed for global users

  10. Must change classroom Wed. • SEC3439

  11. FDBSs - Data Integration • Fully-integrated, logical composite of all constituent databases • Potential problems • Incompatible data types or query syntax • Semantically equivalent by differently named parts of schemas • Use views for data integration reference • Five level schema architecture • Component – translate local schema into common database model

  12. DDBs • Client-Server • Client sends request for service (strict – fixed roles) • 3-tier architecture • P2P • Every site can act as server to store part of DB and as client to request service

  13. DDBS • Issues in DDBS in slides that follow

  14. Data fragments • Can distribute a whole relation at a site or • Data fragments • logical units of the DB assigned for storage at various sites • horizontal fragmentation - subset of tuples in the relation (select) • vertical fragmentation - keeps only certain attributes of relation (project) need a PK

  15. Fragments cont’d • horizontal fragments: • complete - set of fragments whose conditions include every tuple • if disjoint - tuples only member of 1 fragment         salary < 5000 and dno=4 • vertical fragments – • Complete can be obtained by:       L1 U L2 U ... Ln - attributes of R                   where Li intersect Lj = PK(R)

  16. Replication • Full vs. partial replication • Which copy to access • Improves performance for global queries but updates a problem • Ensure consistency of replicated copies of data

  17. Example replication/fragmentation • Example of fragments for company DB:     site 1 - company headquarters gets entire DB     site 2, 3 – horizontal fragments based on dept. no.

  18.  To process a query • Must use data dictionary that includes info on data distribution among servers • Parse user query • decomposed into independent site queries • each site query sent to appropriate server site • site processes local query, sends result to result site • result site combines results of subqueries • Ensure atomicity

  19. Additional Issues in DDBSs • Distributed transaction if data distributed • Transaction manager (TM) knows where data is stored • TM decides which site starts local components of transaction • Distributed query processing

  20. Distributed Query Processing • If Query (read-only) - no 2PC needed • If horizontally fragmented • decompose query into subqueries that can be processed in parallel • one site (result site) accepts all results and puts together • Must also do order by, count

  21. Query Parallelism • Decompose query into parts that can be executed in parallel at several sites • Intra query parallelism • If shared nothing & horizontally fragmented: Select name, phone from account where age > 65 • Decompose into K different queries • Result site accepts all and puts together (order by, count)

  22. Query Parallelism • What if a join and table is fragmented? • Difficult problem if table fragments at different sites • Must get all values of join attributes at one site • Then broadcast to relevant sites value of join attribute • If site 1 has values 1-10 and site 2 has 11-20, only sent to those sites • Result tuples returned, join performed • Example: A |X|B Similar to semi-join |X A1 B1 A2 B2 A3 B3 each at a different site A1,A2,A3 sent to S1 S1 sends FK to B1,B2,B3 sites B1,B2,B3 site send tuples where PK=A1,A2,A3 join performed at S1

  23. Distributed Query Processing • If the data is not fragmented • process whatever operations you can locally (select, project) • Query can involve joins between data at multiple sites • Must transfer data over network • Transfer of data highest cost • Algorithms to minimize amount of data transferred (NP-hard)

  24. Distributed Query Processing • Assume R1 |X| R2 → Site 3 • with R1 at S1 and R2 at S2 • Can do: • R1 → S3 and R2 → S3, do join at S3 • R1 → S2, execute join at S2, result to S3 • R2 → S1, execute join at S1, result to S3

  25. Semi-join |X • Send join attribute from Si to Sj • Select tuples at Sj who match keys sent, send those tuples back to Si • Perform join at Si

  26. Example using Semi-join pdname, lname Department |X|mgr=ssn Employee Assume Department is R1 and Employee is R2: 1) At S2: E‘ ← (pssn Emp) //project join attribute Send to S1 2) At S1: E’’ ← (pmgrssn, dname (Department |X|mgrssn=ssn E')) join with department and send only needed attributes Send to S2 3) At S3:   pdname, lname (E'' |X|mgrssn=ssn Employee) join with employee and send final attributes needed Send to S3

  27. Distributed Query Processing • Henver & Yao • Cost: • time to compute results • time to transmit data • Local data requires only processing time • Assume transmission time is most costly • Goal is to minimize transmission time

  28. Distributed Query Processing • Assume you have: R3=R1|X|att=attR2 • R1 is at S1, R2 at S2 and result relation R3 sent to result site S • Amount of data to transmit: • Assume R1 is 1M bytes • R2 is 250K bytes • R3 is 50k bytes • Using semi-join: • pattR1 is 100K • pattR2 is 50K • pattR1|X|R2 is 10K • pattR2|X|R1 is 25K

  29. Distributed Query Processing • Show all possibilities to previous problem and determine best distributed query processing strategy

  30. Why is Checkpoints/recoveryimportant? • Checkpoints/recovery • HPC – IBM used different chkpt / recovery schemes • Took several days to recover from a crash because different schemes used, no prior knowledge • Needs to be studied

  31. Concurrency Control and Recovery in Distributed Databases • Distributed transactions inherit ACID properties • problems: • failure at individual sites and communication links • distributed deadlock (usually timeout) • distributed commit (use 2PC) • concurrency control with multiple copies of data

  32. Distributed concurrency control using locking • Centralized vs. distributed solutions • Definition: distinguished copy • Locks associated with this copy

  33. Distributed Locking - Primary site technique Primary site technique • Distinguished copy of all data at this site • All locks kept at this site, all requests sent here • Site acts as coordinator for all DB items • If write lock, DBMS must update all copies • If read lock from primary site, can access local copy disadv:  • all locking requests sent to same site - bottleneck • if failure at primary site?  must abort and restart all Ts

  34. Distributed Locking - Primary site with backup Primary site with backup • all lock information at both sites, so failure is less of a problem disadv:  • slower - 2 sites to communicate with

  35. Distributed Locking - Primary copy technique Primary copy technique - distribute responsibility • distinguished copies of different items at different sites • (can also choose a new coordinator if a failure - election to choose self as new site)

  36. Distributed Locking - Voting Voting - no distinguished copy • lock request sent to all sites with copy of data • if granted lock by majority of copies, hold lock and inform sites with copies lock is granted • if don't get a majority, timeout and cancel request sent to all sites disadv:  • lots of message traffic

  37. Distributed Commit - Chap. 23.6 • if updating data at different sites for same transaction, when to commit? • difficult part - commit together so have atomicity • states of basic transactions: • active - can enter other 2 states • committed - cannot leave this state • aborted - cannot leave this state

  38. States commit Active abort

  39. Commit problems • To commit a distributed transaction need a coordinator (originator) site (S1 in our example) • Given transaction T with with components T1 and T2 • where T1 executes at site S1 • T2 executes at site S2 a)  if S1 commits locally, sends commit message to S2 • BUT if S2 crashes, T2 aborts before completes subtask – not atomic b)  if both T1 and T2 ready to commit S1 sends message to commit • BUT if T2 crashes during commit, not atomic if can’t recover

  40. To solve problem • How to solve these problems?  • Need an additional state ready state • when T is done and ready to commit, it goes to the ready state • in the ready state, a transaction can go to the commit or abort state • place prepare log on log buffer, forced to log file - can reconstruct the transaction

  41. States commit Active Ready to commit abort

  42. Use Two-Phase Commit (2PC) • Used by commercial DBS to achieve coordinated commit of distributed transactions • Need a coordinator - usually a site that is participating in the transaction • Only needed if updates

  43. 2PC • Phase 1 – • When each subquery concludes, signal coordinator • If coordinator doesn’t hear from everyone (time out) send abort • Else coordinator receives done from all sends messages to all sites asking them to prepare site transactions • Each site writes changes to log file • if log successful • Send “ready to commit” and enter prepared state • else if fails or cannot commit for other reasons • send “cannot commit”

  44. 2PC • Phase 2 – • If coordinator does not receive prepare to commit from all, coordinator sends abort • Otherwise all prepare requests were successful responses, so coordinator sends commit • Sites record commit in log, update DB subquery concluded     ready to commit (prepared state) commit commit recorded

  45. Solved by 2PC? a)  if S1 commits locally, sends commit to S2 • BUT if S2 crashes, T2 aborts and it is not atomic • With 2PC • if T1 completes Transaction • T1 is prepared • If T2 crashes • T2 is unsuccessful • T1 can still abort from prepared state

  46. Solved by 2PC? b) If both T1 and T2 ready to commit, S1 sends message to commit • BUT if T2 crashes before commit completed, not atomic • With 2PC • if both T1 and T2 successfully complete Transaction • T2 and T1 are both prepared • T1 able to commit. If T2 crashes during commit can recover prepared state and commit – uses log

More Related