280 likes | 302 Views
Distributed Databases (DDB). Introduction to DDB Three DDB Exercises General Electric J.C. Penney Stores UPS Integrating Autonomous Systems Replication Replication Exercise Implementing Replication. Distributed Databases—Introduction. What is a Distributed Database (DDB)?
E N D
Distributed Databases (DDB) • Introduction to DDB • Three DDB Exercises • General Electric • J.C. Penney Stores • UPS • Integrating Autonomous Systems • Replication • Replication Exercise • Implementing Replication
Distributed Databases—Introduction • What is a Distributed Database (DDB)? • How many ways can a database be distributed?(And for how many reasons?) ? ?
Distributed Databases—Introduction (cont.) • A DDB runs on more than one computer (duhhhh) • All or part of the data is distributed • Servers may have unique data or copies of data also on other servers • Some data may be on clients (not servers) for strictly local use • Servers may be right next to each other (same room or even rack) or around the world • DDB may share a strong schema (design) that applies to all instances or may have a logical schema that integrates disparate autonomous DB
Distributed Databases—Introduction (cont.) • Reasons for distributing data • Workload distribution for scalability • Getting data closer to users in a distributed organization • Autonomous operating elements of a diverse organization have different data needs • Reporting needs to central HQ • Coordination needs
Distributed Databases—Introduction (cont.) • Three main types/reasons of/for distributing data • Diverse distributed organizational divisions have unique data needs with some requirements for centralized coordination and/or reporting • Homogeneous distributed organization requires geographically available data but strong needs for centralized management and control • Homogeneous organization requires workload distribution to support scalability • Any combination of these circumstances can exist
Exercises • Three Distributed System Design Exercises • General Electric • J.C. Penney • UPS • Consider • Corporate control • Transaction processing • Reporting • Data integrity • Data warehousing
Exercise #1—The General Electric Case • Consider the GE product line: Light bulbs, large and small consumer appliances, power generation equipment, jet engines, financial services, railroad locomotives, healthcare equipmenthttp://www.ge.com/products_services/index.html • Assume that each division has the databases and systems needed for its internal operations • What are examples of data needs for GE HQ and what technical capabilities must exist to satisfy the needs? • Are there any needs for cross-division data communication? If so, identify technical needs.
Exercise #2—The J.C. Penney Case • Consider the operations of the J.C. Penney retail department stores only. More than 1,000 stores throughout US & Puerto Rico. • What are examples of data needs for JCP HQ and what technical capabilities must exist to satisfy the needs? • Are there unique local needs?
Exercise #3—The UPS Case • Consider UPS customer and package tracking systems characterized by millions of daily transactions, package tracking as a competitive tool, worldwide locations, and disconnected (from the network) transactions at the driver delivery transaction. • What transactions (if any) are local? • What special data sharing must take place? • What is the implication of the disconnected drivers?
Integrating Autonomous Systems • Consider the GE example with additional conditions • Diverse nodes have different DBMS software • Similar tables in different nodes have different structures • Different column structures • Different data types for same/similar columns • Different data integrity requirements • Will most cross-division DB events be: • Queries/retrievals? • Inserts/updates?
Integrating Autonomous Systems (cont.) • Data Warehousing—the simple solution • When corporate needs can be satisfied with a DW solution we can use existing tools • Ideally suited for disparate data sources • Transforming disparate data • Decisions include • Frequency of updates • Granularity • Dimension tables applicable to unique division data
Integrating Autonomous Systems (cont.) • Querying (and Updating)—the Not-So-Simple Solution • The integration solution includes • Local DB Management Component (LDBMC) • Manages the local system • Global Data Dictionary (GDD) • An integrated logical data model for the entire DDB with location and structure information • Distributed DB Management Component (DDBMS) • Provides the interface to the global system for local users
Integrating Autonomous Systems (cont.) • Distributed DB Management Component (DDBMS) • Provides logical DB schema interface or data dictionary • A one-source view of the entire DDB • Provides location transparency • Query Processor (more to come) • Network-wide concurrency control • More applicable to distributed homogeneous systems with integrated operations • Query translation for heterogeneous systems (more)
Integrating Autonomous Systems (cont.) • DDBMS Query Processor • Determines location of data needed to process query • For one-source queries: Sends query to appropriate local or remote location and returns results • For multi-source queries: • Parses query into sub-queries against individual locations • Receives individual results • Assembles a result set from the pieces • May include translation of column elements in heterogeneous systems
Integrating Autonomous Systems (cont.) • A wide variety of commercial products are available for managing distributed autonomous systems • Think of the integration packages available in our Data Warehouse loading packages • Difference is that translations act in near real time • Product an integrated schema with query rules
Replication • Replication consists of • A single physical source or master database • Replicated data in distributed locations to support transactions, reporting, and control • Tools for distributing data • Tools for integrating changes to data • Typically has (mostly) homogeneous data structures and DBMS at all locations • Not all data at every location
Replication (cont.) • Design choices consist of • What data is distributed to which location • Filter by row • Filter by column • What update rights are available at each location • How frequently data must be synchronized • How data is to be synchronized (what replication model is used) • Data integrity rules at different levels • Handling data integrity violations
Replication—The Publisher/Subscriber Model • Replication is spoken of using print publication terms • Publisher—the system node that produces data for consumption by other nodes • Subscriber—the node that receives data • Article—a database object (table, view, stored procedure, etc.) that is to be published • Publication—a collection of articles sent and received as one logical unit • Nodes may be both publishers and subscribers
Snapshot Replication • Snapshot Replication is what the name implies—a complete copy of data sent to a subscriber • Often used as an initial copy of data forming the basis for other kinds of replication • May be read-only data sent to a subscriber • Product list with prices • Authorized vendors • Out-of-synch data may be acceptable for the period between replications • Data may change infrequently
Transactional Replication • Data is changed at the publisher and propagated to subscribers as changes occur • E.g., Units in stock at a warehouse • Customer bank balances • Data is usually treated as read-only at subscriber but need not be • Decisions regarding frequency of update messages are required • Balance workload and network traffic with latency • Consider the costs of latency
Merge Replication • Typically implemented in a Client-Server environment • Client (subscriber) makes changes to data that are published back to server (publisher) • Multiple subscribers may be capable of changing the same data • Data changes may be made offline • Update frequency also balances workload and network traffic with latency • Subscriber changes may be legal at the client but create data integrity violations when updated to the server
Replication (cont.) • All three types of replication may be implemented in same application • Snapshot for base-level data with transactional replication for updates • Snapshot for initial data, merge replication by a subscriber, transactional replication to publish changes to other subscribers • Clients may easily have data not replicated to the Server • Employee shift schedules • Detailed transaction data—only summaries published
Replication Exercise • Consider the University Registration DB we used in data warehousing (next slide) • Assume a state-wide multi-branch university system with a System Central Administration and campuses • Assume homogeneous DBMS at each location • Identify: • Subscription model for each table • Update rights on each table • Update frequency for each table • How much latency will be allowed in each table
Replication—Data Integrity • Many data integrity rules in SQL Server have "Not for replication" options available • Foreign key • Identity attribute values • Triggers • Constraints • If you do not publish Suppliers data to individual stores you would turn off referential integrity checking from Products to Suppliers
Replication—Data Integrity (cont.) • Publisher (server) may have to detect when a subscriber (client) makes a transaction that is legal to the client but illegal to the server • Two sales representatives sell the same product that creates a backorder situation • Two purchasers make purchase on same account that exceeds credit limit • Violation potential must be identified and incorporated into replication strategy
Implementing Replication • SQL Server has a series of Replication Agents that automate replication • Identify publishers and subscribers • Build subscriptions, articles, and publications • Produce snapshots • Implement transaction replication according to rules • Implement merge replication notifications • Manage data integrity rules • Similar tools available in other systems