Distributed Databases

Distributed Databases • Logical next step in geographically dispersed organisations • goal is to provide location transparency • starting point = a set of decentralised DBs located in different places, developed for the specific information needs of each site • Aim: to integrate these decentralised DBs into a coherent DDB

Advantages of Distributed DBs: • Increased reliability of systems and availability of data • Local control preserved • Modular growth possible at each site and at new sites • Optimised communication costs • Faster response times

Control in normal DBs • transaction control: ability of the DBMS to ensure the successful completion of transactions • commit transactions • roll-back to previous state • concurrency control: ability of the DBMS to arbitrate between concurrent uses of data: • simultaneous access • simultaneous update • deletion

Control in Distributed DBs • Different portions of the overall database reside at different locations • these portions are controlled by different processors running sometimes different DBMSs • common schema means queries can involve any portion of the DB residing at any location

Options for Distributed DBs • Issue of physical design (data structure) • performance of the DB (response time...) depends upon good design • There are a number of options: • data replication • horizontal partitioning • vertical partitioning • combinations of the above

Data replication • store a separate copy of the full tables in each location • if a copy is stored at every site: Full Replication • Advantages: • reliability • fast response • Disadvantages • storage requirements • complexity and cost of updating

Horizontal partitioning • some of the rows of the tables are stored in one location; others are stored at other locations • eg: customers banking out of a particular branch • Advantages: • efficiency • local optimisation • security • Disadvantages: • inconsistent speed access • backup vulnerability

Vertical partitioning • some columns are projected into base relationship at different sites • all relations share a common domain so the full table can be reconstructed • Advantages: • tailor-made support for functional areas • same as horizontal partitioning • Disadvantages: • some queries might be very slow • users must understand some design issues

Combinations of the three methods • most of the time, companies will use different methods • each method is efficient in certain situations + some other security requirements • eg: local customers, information originating at a certain site, shared processes that require the same data at all sites • it is a design issue to try to identify the optimal distribution - data at the sites where it is used most

Distributed DBMS • additional roles to play in the case of a distributed DB • determine the location where data to be retrieved is located • translate the request into the language used by the local DBMS • deal with normal data management functions, security matters, locking, query optimisation...

Heterogeneous Distributed DBMS • a different DBMS running at each site • a master DBMS controlling the interactions amongst the parts • not practical today (compatibility) • more often, each DBMS follows the same data architecture

Problems with global transactions • DBMSs can be radically different - relational versus network • only some state-of-the-art commercial products have translating capabilities • one alternative solution is to put some essential data and the directory of the data locations on a central server • Real distributed DBMS solve these problems for the users with the help of the NOS

Commit Protocol • to ensure the integrity of the data in update operations • well defined procedure based on the exchange of messages (“ok” or “not ok”) • each global transaction can either be complete (and completed) or aborted • Two-phase commit: • site originating the transaction sends requests to all sites involved in the update • all sites attempt to process their part of the transaction without committing the data (temp files) • they notify the first site whether OK or not • the first site collects all OKs and sends order to commit the data

Timestamping • Alternative to locking (possibility of deadlocks) • ensures that transactions are processed in serial order so locking in not needed • All updated records carry the timestamp of the transactions that modified them • if new transaction attempts to update a record with an earlier timestamp = OK • If new transaction ...with a later stamp, update access is denied, the transaction is re-stamped and is re-started

Example: Record in a DB 168 Record update: 170 OK Updated record 170 Record Update: 165 Denied Record Update: 170 Transaction re-started (ie: do it again) Updated record 170 +++: costly deadlock situations are avoided ----: transactions may sometimes be restarted even though they did not conflict with previous ones.

Effect of design on speed • how to design fast queries • simple example with two sites in relational DB: • supplier (Supplier#, ...,City): 10,000 records stored in Detroit • part (part#, .., colour): 100,000 records stored in Chicago • Shipment (supplier#,..., Part#): 1,000,000 records stored in Detroit • each record is 100 characters long + there are 10 red parts • data transmission is 10,000 character/second, 1 second delay in any communication • data processing negligible • Write the SQL statement • Imagine how the query can be carried out between the two sites

SQL statement select supplier.supplier# from supplier, part, shipment where supplier.city = ‘Cleveland’ and supplier.supplier# = shipment.supplier# and shipment.part# = part.part# and part.color = ‘Red’

Conclusions • Reasonably easy to optimise query with two tables • Very complex with more than two (try with 30!) • Rules: • Queries must be broken down into components isolated at different sites (minimise communication time and traffic) • Determine which site has the potential to yield FEWER selected records • Move preliminary results to site where rest of the work can be performed (ie: try to move as few records as possible)

Distributed Databases