Distributed Databases -> Data Distribution

Distributed Databases-> Data Distribution

Assume relational data model Replication System maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance. Fragmentation Relation is partitioned into several fragments stored in distinct sites Replication and fragmentation can be combined Relation is partitioned into several fragments: system maintains several identical replicas of each such fragment. Distributed Data Storage

A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites. Full replication of a relation is the case where the relation is stored at all sites. Fully redundant databases are those in which every site contains a copy of the entire database. Data Replication

Advantages of Replication Availability: failure of site containing relation r does not result in unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes in parallel. Reduced data transfer: relation r is available locally at each site containing a replica of r. Disadvantages of Replication Increased cost of updates: each replica of relation r must be updated. Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. One solution: choose one copy as primary copy and apply concurrency control operations on primary copy Data Replication (Cont.)

Division of relation r into fragments r1, r2, …, rnwhich contain sufficient information to reconstruct relation r. Horizontal fragmentation: each tuple of r is assigned to one or more fragments Vertical fragmentation: the schema for relation r is split into several smaller schemas All schemas must contain a common candidate key (or superkey) to ensure lossless join property. A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key. Example : relation account with following schema Account = (branch_name, account_number, balance ) Data Fragmentation

Horizontal Fragmentation of account Relation account_number branch_name balance Hillside Hillside Hillside A-305 A-226 A-155 500 336 62 account1 =branch_name=“Hillside” (account ) account_number branch_name balance Valleyview Valleyview Valleyview Valleyview A-177 A-402 A-408 A-639 205 10000 1123 750 account2 =branch_name=“Valleyview” (account )

Vertical Fragmentation of employee_info Relation tuple_id branch_name customer_name Lowman Camp Camp Kahn Kahn Kahn Green 1 2 3 4 5 6 7 Hillside Hillside Valleyview Valleyview Hillside Valleyview Valleyview deposit1 =branch_name, customer_name, tuple_id (employee_info ) account_number tuple_id balance 500 336 205 10000 62 1123 750 A-305 A-226 A-177 A-402 A-155 A-408 A-639 1 2 3 4 5 6 7 deposit2 =account_number, balance, tuple_id (employee_info )

Completeness ➠ Decomposition of relation R into fragments R1, R2, ..., Rn is complete if and only if each data item in R can also be found in some Ri Reconstruction ➠ If relation R is decomposed into fragments R1, R2, ..., Rn, then there should exist some relational operator ∇such that R = ∇1≤i≤nRi Disjointness ➠ If relation R is decomposed into fragments R1, R2, ..., Rn, and data item di is in Rj, then di should not be in any other fragment Rk (k ≠ j ). Correctness of Fragmentation

Horizontal: allows parallel processing on fragments of a relation allows a relation to be split so that tuples are located where they are most frequently accessed Vertical: allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed tuple-id attribute allows efficient joining of vertical fragments allows parallel processing on a relation Vertical and horizontal fragmentation can be mixed. Fragments may be successively fragmented to an arbitrary depth. Advantages of Fragmentation

Data Allocation Four alternative strategies regarding placement of data: • Centralized: single database and DBMS stored at one site with users distributed across the network. • Partitioned: Database partitioned into disjoint fragments, each fragment assigned to one site. • Complete Replication: Consists of maintaining complete copy of database at each site. • Selective Replication: Combination of partitioning, replication, and centralization. Comparison of strategies

Transparencies in a DDBMS Transparency hides implementation details from users. Overall objective: equivalence to user of DDBMs to centralised DBMS - FULL transparency not universally accepted objective Four main types: • Distribution transparency • Transaction transparency • Performance transparency • DBMS transparency (only applicable to heterogeneous)

1. Distribution Transparency Distribution transparency:allows user to perceive database as single, logical entity. If DDBMS exhibits distribution transparency, user does not need to know: • fragmentation transparency: data is fragmented • Location transparency: location of data items • otherwise call this local mapping transparency • replication transparency: user unaware of replication of fragments Naming transparency: each item in a DDB must have a unique name. • One solution: create central name server - loss of some local autonomy. - central site may become a bottleneck. - low availability: if the central site fails. • Alternative solution: prefix object with identifier of creator site, each fragment and its copies. Then each site uses alias.

2. Transaction Transparency Transaction transparency: Ensures all distributed Ts maintain distributed database’s integrity and consistency. • Distributed T accesses data stored at more than one location. • Each T is divided into no. of subTs, one for each site that has to be accessed. • DDBMS must ensure the indivisibility of both the global T and each of the subTs.

2. Transaction Transparency • Concurrency transparency: All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order. • Replication makes concurrency more complex • Failure transparency: must ensure atomicity and durability of global T. • Means ensuring that subTs of global T either all commit or all abort. • Classification transparency: In IBM’s Distributed Relational Database Architecture (DRDA), four types of Ts: • Remote request • Remote unit of work • Distributed unit of work • Distributed request.

3. Performance Transparency • DDBMS: -no performance degradation due to distributed architecture. • - determine most cost-effective strategy to execute a request. Distributed Query Processor (DQP) maps data request into ordered sequence of operations on local databases. - Must consider fragmentation, replication, and allocation schemas. DQP has to decide: • which fragment to access • which copy of a fragment to use • which location to use. - produces execution strategy optimized with respect to some cost function. • Typically, costs associated with a distributed request include: I/O cost; • CPU cost, communication cost.

Distributed Databases -> Data Distribution