260 likes | 437 Views
Distributed Databases. John Ortiz. Distributed Databases. Distributed Database (DDB) is a collection of interrelated databases interconnected by a computer network Distributed Database Management System (DDBMS) is software which manages a distributed database
E N D
Distributed Databases John Ortiz
Distributed Databases • Distributed Database (DDB) is a collection of interrelated databases interconnected by a computer network • Distributed Database Management System (DDBMS) is software which manages a distributed database • World Wide Web technology does not yet constitute a DDB by our definition Distributed Databases
Advantages of a DDB • Supports various levels of transparency • Distribution (network) transparency • Degree to which user is unaware of the networked nature of the DB • Replication transparency • Degree to which user is unaware of copies of the DB • Fragmentation transparency • Degree to which user is unaware the DB is broken into pieces Distributed Databases
Advantages of a DDB • Increased Reliability and Availability • Reliability – probability a system is running at a particular point in time • Availability – probability a system is continuously available during a time interval Distributed Databases
Advantages of a DDB • Improved Performance • Supports data localization – data is kept near where it is most often used to reduce affects of network delay • Easier Expansion • Adding more data, increasing DB size, adding resources is easier • Reduced Operation Costs (when considering a mainframe system) • cheaper to add workstations than a new mainframe computer Distributed Databases
Advantages of a DDB • No Single Point of Failure • When one computer fails, others can take its place Distributed Databases
Disadvantages of a DDB • Significant increase in complexity • Normalization, query optimization, security, transaction processing, concurrency control, crash recovery, etc. ALL become much more difficult to handle • Increased storage requirements • Since multiple copies of various portions of the DB exist, more storage space is required Distributed Databases
Data Fragmentation • Fragmentation is the division of the database into pieces stored at different sites • Horizontal Fragmentation – a subset of tuples in a particular relation • the result of a query which SELECTS some tuples, but not others produces a horizontal “fragment” • In a DDB, the output from the previous query may be stored as a separate DB at a separate site • Requires a UNION to recombine information Distributed Databases
Data Fragmentation • Vertical Fragmentation – a subset of attributes of a particular relation • The result of a query which PROJECTS certain, specific attributes • Requires an outer join (or an outer union) to recombine information • Hybrid Fragmentation – can you guess? • Includes both horizontal and vertical fragmentation • Complete fragmentation simply means all tuples/attributes are in the result • A fragmentation schema Distributed Databases
Data Fragmentation • A fragmentation schema is a definition of the set of fragments that includes all attributes and tuples sufficient to reconstruct the DB • An allocation schema describes which fragments are at what sites Distributed Databases
Data Replication • Replication is the creation of copies of the DB • A DDB may be fully replicated (a copy of the entire DB is made at each site) • Why would you want to make a full copy of a DDB? • A DDB may have no replication (each fragment is stored at one and only one site) • Naturally, a DDB may be partially replicated • A replication schema is a description of what pieces are copied at which sites Distributed Databases
Data Replication • Replication creates new consistency and redundancy problems • Every piece of data that is replicated is redundant, and therefore subject to be inconsistent • These copies may be updated separately which causes inconsistency • How much inconsistency acceptable? Distributed Databases
Synchronization • Synchronization is the process of of updating the individual replicas • Since pieces are stored in different places, the DDB must periodically be made consistent • Synchronization can be expensive in terms of network resources and time • It is not simply copying one replica to another – most recent updates on both copies being synchronized must be accounted for • P.775 - 778 in the text has an example of a DDB Distributed Databases
US Air Force Email • We have noted in the past that there are many types of databases such as spreadsheets, address books, and even documents (such as MS Word) • Consider the AF with approximately 500,000 people who all have email addresses and need to communicate • They have constructed a global email address book and make use of replication • The AF is divided into levels: global, command, base Distributed Databases
US Air Force Email • Initially the bases were each set up with email and interconnected via the network • However, you had to know the email address of anyone at a different base • Eventually, each command (a group of related bases) set up an address book consisting of all the bases • Each base maintains a complete replica of the entire commands address book • Why not just a piece? Distributed Databases
US Air Force Email • The DB is synchronized each night • So, when someone moves, their email address is removed from the local copy • All the other bases will still have that “old” email address until the next day, at which point the DDB is consistent again • I believe that now the entire AF address book is available at each base • Not sure how often it is synchronized, perhaps weekly • Search for an email address is quick Distributed Databases
US Air Force Email • Search for an email address is quick since a local copy is kept • This reduces network traffic considerably compared with everyone having to search a centralized DB for email addresses Distributed Databases
Query Processing in DDB • When we looked at query processing before, the largest delay was with the disk • Now, that same concept is extended to include network delay – which can be much longer • Suppose the EMPLOYEE DB (10,000 records, 100 bytes each) is at site 1, and the DEPARTMENT DB (100 records, 35 bytes each) is at site 2 • YOU are at site 3 • Assume result is 400,000 bytes Distributed Databases
Query Processing in DDB • SELECT E_Name • FROM EMPLOYEE • WHERE DeptNum = 5 • There are 3 strategies: • 1) Txfr both DBs to site 3 to perform the query • (1,003,500 bytes txfr’d) • 2) Txfr EMPLOYEE to site 2, perform the query, txfr result to site 3 (1,400,000 bytes txfr’d) • 3) Txfr DEPARTMENT to site 1, perform the query, txfr result to site 3 (403,500 bytes) Distributed Databases
Query Processing using Semijoin • Rather than sending the entire set of records to be joined, we could just send the joining attribute(s) • Then the join is performed and the join attributes as well as the attributes projected, can be transferred to the requesting site • The semijoin is symbolized as: • NOTE: • R S S R • Substantially reduces amount of data txfr’d Distributed Databases
Concurrency Control and Recovery • Dealing with multiple copies • Failure of individual sites • Failure of network • Distributed commit is more complicated • Deadlock is more difficult to detect and prevent • A number of techniques have been proposed to deal with these problems Distributed Databases
Distinguished Copy • The locks for a data item are associated with the distinguished copy • There are several distinguished copy variations: • Primary site (with backup) • One site is the chosen one and coordinates locking activities (centralized locking) • Primary copy • Various fragments at different sites are chosen as the distinguished copy – this distributes the locking problem Distributed Databases
Distributed Recovery • Very complex • Suppose that X sends a request to Y – there may be a number of reasons the request was not granted • Message was never delivered • Site Y is down • Site Y sent a response but the response was not delivered Distributed Databases
Summary • Re-read the first 23 slides! • Advantages/Disadvantages of a DDB • The 3 Transparencies: network, replication, fragmentation • Fragmentation • Replication and Synchronization • Query Processing in a DDB • Semijoin • Concurrency Control and Recovery Distributed Databases
Primary Site Technique Distributed Databases