Distributed Databases

Distributed Databases • Business needs for distributed databases • Introduction to distributed databases • Subscriber / Publisher Model • Snapshots • Transactional Replication • Merge Replication • Dissimilar Databases • Implementing Distributed DB • Design Implications • Advantages & Disadvantages

Business Needs for Distributed Databases • The concept of a central database to handle all of the organization’s needs has several potential limitations • Geographically dispersed organization requires extensive database traffic • Large organization creates congestion at the server • Large volumes of data must be moved across the network • The entire organization can be vulnerable to a problem with a single server • Data communications interruptions can disrupt the entire organization’s operations

Business Needs for Distributed Databases (cont.) • Central database limitations (cont.) • Dissimilar operating units create differing data access needs • Local units require autonomy over the design and implementation of DB systems • Information sharing across the organization still requires connectivity • Local unit DB designers will not be allowed to design against the entire DB

Business Needs for Distributed Databases (cont.) • Central database limitations (cont.) • Mergers and acquisitions create ad-hoc integration of dissimilar DB systems • Different business units may have fully developed DB and applications on dissimilar platforms, DBMS, etc. • The organization still requires information sharing for organizational effectiveness • Rewriting the whole system in a single DB is impractical (or may take time to implement)

Distributed Databases • Distributed Databases are characterized by decisions made regarding: • Distribution of data schema • All nodes share same schema or not • Update rights on objects (especially table data) • Latency / concurrency requirements • Commonality of DBMS

Subscriber/Publisher Model • A susbcriber / publisher model is often used to describe database updates • Nodes allowed to change data & objects are publishers • Nodes needing to be aware of changes are subscribers • Decisions are made on methods for making subscribers aware of changes and of getting changes to them • Near real time • On demand • Batch • On schedule

Snapshots • Distribution of databases (except in connecting existing databases) usually start with a snapshot of all or part of a DB • Copy of structures, data, SP, triggers, etc. • The snapshot is distributed to all nodes • May be different snapshots to different nodes

A Scenario • Corporate HQ isthe central site • Regional HQ or even ‘retail’ locations are Remote sites • Remote sites executefrequent transactions • Q: What data isneeded in each locationfor the organization’s business needs?

Transactional Replication • In transactional replication aseach transaction is executed on any node it is ‘published’ to all subscribing nodes which also execute the transaction • Data integrity rules are checked at each node • Violation of a data integrity rule at any node can roll back the transaction at all nodes • Data is kept relatively current at all nodes

Transactional Replication (cont.) • Application (“business”) needs control urgency and frequency of updates • Some data is read only at some nodes • Price schedule might be set centrally and only read locally • Sales transactions are probably executed locally and rolled up centrally

Transactional Replication • When is Transactional Replication appropriate? • Higher interaction between actions at nodes (easier to cause conflicts with out of date data) • Decision making requires updated information • Frequent changes can cause concurrency problems • Connectivity is not an issue • Detected problems can result in near-real time rollbacks

Merge Replication • In Merge Replication subscribers may receive a partition of the data • Certain rows • Only customers or employees in their region • Certain columns • Employee contact info but not salary info • Subscribers may add, update, or delete rows to which they have write access • Changes are committed (published) to the subcribers in a batch (merged back into the subscriber DB)

Merge Replication (cont.) • System is able to detect whenremote site copy of data haschanged (including newrecords) • Changed data is marked forupdating in central copyduring merge

Merge Replication (cont) • When is merge replication appropriate? • Few chances for node operations to create conflicts • Highly autonomous activities • Different lines of business • Infrequent changes requiring immediate awareness by all subscribers • Physical connectivity issues • May create more complex problems when a conflict does occur • Rolling back already committed transactions

Dissimilar Databases • Distributed DB nodes may be dissimilar on two dimensions • DB architecture (table structure, field data types/names, etc.) • DBMS and OS (may not even be relational data) • “Messages” sent between nodes to inform them of updates must be translated somewhere • Imposes new layers of complexity for connectivity • SQL Server provides support for this process • Many third party products for logical integration

Implementing DB Distribution • SQL Server comes with a wealth of distributed DB management tools • Specify publication schedules, rights, update frequencies, etc. • Manage conflicts when they occur and notify clients • Perform translations between DBMS • Perform translations between structures

Design Implications • Some DB designs may change when the DB is replicated • Relationships may not be enforced in remote nodes because matching parent rules may not exist • GUID attributes may be needed for PKs since independently generated Identity attributes could conflict when rolled up • Triggers or constraints may be different • May violate locally but be OK globally • Vice-versa

Database Distribution Advantages & Tradeoffs • Key advantages of distributed DB • Increased reliability • Local access and control • Modular growth • Lower communication costs • Faster response What are the mechanisms thatgive rise to theseadvantages?

Database Distribution Advantages & Tradeoffs (cont.) • Disadvantages of distributed DB • Software cost & complexity • Keeping data current • Maintaining data integrity • Integrating multiple sites and applications • Processing overhead • Data integrity • Slow response from poor design

Distributed DBMS (cont.) • Distributed DBMS attempts to achieve “Location Transparency” • User or application will not need to know that the query is going to multiple nodes • User has one integrated DB schema • Distributed DBMS performs all network operations • Also seek to achieve “Replication Transparency” • Replication operations are performed automatically • Manages multiple updates against different copies of replicated data

Distributed Databases