Distributed Databases

Distributed Databases

Distributed Databases • Parallel databases • Architecture • Query evaluation • Query optimization • Distributed databases • Architectures • Data storage • Catalog management • Query processing • Transactions

Introduction • A parallel database system is designed to improve performance through parallelism • Loading data, building indexes, evaluating queries • Data may be stored in a distributed way, but solely for performance reasons • A distributed database system is physically stored across several sites • Each site is managed by an independent DBMS • Distribution is affected by local ownership, and availability as well as performance

Parallel Databases

Motivation • How long does it take to scan a 1 terabyte table at 10MB/s? • 1,099,511,627,776 bytes = 1,0244 or 240 bytes • 10MB = 10,485,760 bytes • 1,099,511,627,776 / 10,485,760 = 104,858 • 104,858 / (60 * 20 * 24) = 1.2 days! • Using 1,000 processors in parallel the time can be reduced to 1.5 minutes

Coarse-Grain and Fine-Grain • A coarse-grain parallel machine consists of a small number of processors • Most current high-end computers • A fine-grain parallel machine uses thousands of smaller processors • Also referred to as a massively parallel machine

Performance • Both throughput and response time can be improved by parallelism • Throughput – the number of tasks completed in a given time • Processing many small tasks in parallel increases throughput • Response time – the time it takes to complete a single task • Subtasks of large transactions can be performed in parallel increasing response time

Speed-Up, Scale-Up ideal • Speed-up • More resources means less time for a given amount of data • Scale-up • If resources increase in proportion to increase in data size, time is constant throughput degreeof parallelism ideal responsetime degree of parallelism

Parallel Database Architecture • Where possible a parallel database should carry out evaluation steps in parallel • There are many opportunities for parallelism in a relational database • There are three main parallel DBMS architectures • Shared nothing • Shared memory • Shared disk

P P P D D D Shared Memory Architecture • Multiple CPUs attached to an interconnection network • Accessing a common region of main memory • Similar to a conventional system • Good for moderate parallelism • Communication overhead is low • OS services control the CPUs • Interference increases with size • As CPUs are added memory contention becomes a bottleneck • Adding more CPUs eventually slows the system down … interconnectionnetwork globalsharedmemory

P P P D D D Shared Disk Architecture • Each CPU has private memory and direct access to data • Through the interconnection network • Good for moderate parallelism • Suffers from interference in the interconnection network • Which acts as a bottleneck • Not a good solution for a large scale parallel system M … M M interconnectionnetwork

P P P D D D Shared Nothing Architecture • Each CPU has local memory and disk space • No two CPUs access the same storage area • All CPU communication is through the network • Increases complexity • Linear speed-up • Operation time decreases proportional to increase in CPUs • Linear scale-up • Performance maintained if CPU increase is proportional to data interconnectionnetwork … M M M

Parallel Query Evaluation • A relational query execution plan is a tree, or graph, of relational algebra operators • Operators in a query tree can be executed in parallel • If one operator consumes the output of another, there is pipelined parallelism • Otherwise the operators can be evaluated independently • An operator blocks if it does not produce any output until it has consumed all its inputs • Pipelined parallelism is limited by blocking operators

Single Operator Evaluation • Individual operators can be evaluated in a parallel way by partitioning input data • In data-partitioned parallel evaluation the input data is partitioned, and worked on in parallel • The results are then combined • Tables are horizontally partitioned • Different rows are assigned to different processors

Data Partitioning • Partition using a round-robin algorithm • Partition using hashing • Partition using ranges of field values

Round-Robin • Partition using a round-robin algorithm • Assign record i to processori mod n • Similar to RAID systems • Suitable for evaluating queries that access the entire table • Less efficient for queries that access ranges of values and queries on equality

Hash Partitioning • Partition using hashing • A hash function based on selected attributes is applied to each record to determine its processor • The data remains evenly distributed as the table grows, or shrinks over time • Good for equality selections • Only one disk is used, leaving the others free • Also useful for sequential scans where the partitioning attributes are a candidate key

Range Partitioning • Partition using ranges of field values • Ranges are chosen from the sort key values, each range should contain the same number of records • Each disk contains one range • If a range is too large can lead to data skew • Skew can lead to the processors with large partitions becoming bottlenecks • Good for equality selections, and range selections

Data Skew • Both hash and range partitioning may result in data skew • Where some partitions are larger or smaller • Skew can dramatically reduce the speed-up obtained from parallelism • In range partitioning skew can be reduced by using histograms • The histograms contain the number of attributes and are used to derive even partitions

Parallel Evaluation Code • Parallel data streams are used to provide data for relational operators • The streams can come from different disks, or • Output of other operators • Streams are merged or split • Merged to provide the inputs for of a relational operator • Split as needed to parallelize processing • These operations can buffer data, and should be able to halt operators that provide their input data • A parallel evaluation consists of a network of relational, merge and split operators

Types of Parallelism • Interqueryparalellism • Different queries or transactions execute in parallel • Throughput is increased but response time is not • Easy to support in a shared-memory system • Intraquery parallelism • Executing a single query in parallel to speed up large queries • Which in turn can entail either intraoperation or interoperation parallelism, or both

Parallel Operations • Scanning and loading • Pages can be read in parallel while scanning a relation • The results can be merged • If hash or range partitioning is used selections can be directed to the relevant processors • Sorting • Joins

Sorting • The simplest sort method is for each processor to sort its portion of the table • Then merge the sorted records • The merging phase may limit the amount of parallelism • A better method is to first redistribute the records over the processors using range partitioning • Using the sort attributes • Each processor sorts its set of records • The sets of sorted records are then retrieved in order • To make the partitions even the data in the processors can be sampled

Joins • Join algorithms can be parallelized • Parallelization is most effective for hash or sort-merge joins • Parallel hash join is widely used • The process for parallel hash join is • First partition the two tables across the processors using the same hash function • Join the records locally, using any join algorithm • Merge the results of the local joins, the union of these results is the join of the two tables

Improved Parallel Hash Join • If tables are very large parallel hash join may have a high cost at each processor • If each partition is large, multiple passes will be required for the local joins • An alternative approach is to use all processors for each partition • Partition the tables using h1 • Each partition of the smaller relation should fit into the combined memory of the processors • Process each partition using all processors • Use h2 to determine which processor to end records to

Joins on Inequalities • Partitioning is not suitable for joins on inequalities • Such as R ⋈R.a < S.b S • Since all records in R could join with a record in S • Fragment and replicate joins can be used • In asymmetric fragment and replicate join • One of the relations is partitioned • The other relation is replicated across all partitions

Fragment and Replicate • Each relation can be both fragmented and replicated • Into m fragments of R and n of S • However m * n processors are required • This works with any join condition • When partitioning is not possible S0 S1 S2 … Sn-1 R0 Pm-1,n-1 P0,0 P0,1 P2,0 P0,2 P1,1 P1,0 R1 R2 … Rm-1

Other Operations • Selection – the table may be partitioned on the selection attribute • If not, it can be scanned in parallel • Duplicate elimination – use parallel sorting • Projection – can be performed by scanning • Aggregation – partition the table on the grouping attribute • If records do have to be transferred between processors it may be possible to just send partial result • The final result can then be calculated from the partial results • e.g. sum

Costs of Parallelism • Using parallel processors reduces the time to perform an operation • Possibly to as little as 1/n * original cost • Where n is the number of processors • However there are also additional costs • Start-up costs for initiating the operation • Skew which may reduce the speed-up • Contention for resources resulting in delays • Cost of assembling the final result

Parallel Query Optimization • As well as parallelizing individual operators, different operators can be processed in parallel • Different processors perform different operations • Result of one operator can be pipelined into another • Note that sorting and the hash-join partitioning block pipelines • Multiple independent operations can be executed concurrently • Using bushy, rather than left-deep, join trees

Parallel vs. Serial Plans • The best serial plan may not be the best parallel plan • Also note that parallelization introduces further complexity into query optimization • Consider a table partitioned into two nodes, with a local secondary index • Node 1 contains names between A and M • Node 2 contains names between N and Z • Consider the selection: name < “Noober“ • Node 1 should scan its partition, but • Node 2 should use the name index

Parallel System Design • In a large-scale parallel system the chances of failure increase • Such systems should be designed to operate even if a processor disk fails • Data can be replicated across multiple processors • Failed processors or disks are tracked • And request re-routed to the backup

Summary • Architecture • Shared-memory is easy, but costly and does not scale well • Shared-nothing is cheap and scales well, but is harder to implement • Both intraoperation, and interoperation parallelism are possible • Most relational algebra operations can be performed in parallel • How the data is partitioned across processors is very important

Distributed Databases

Introduction • A distributed database is motivated by a number of factors • Increased availability • If a site containing a table goes down, the table may still be available if a copy is maintained at another site • Distributed access to data • An organization may have branches in several cities • Access patterns are typically affected by locality • Analysis of distributed data • Distributed systems must support integrated access

Ideal • Data is stored at several sites • Each site is managed by an independent DBMS • The system should make the fact that data is distributed transparent to the user • Distributed Data Independence • Users should not need to know where the data is located • Queries that access several sites should be optimized • Distributed Transaction Atomicity • Users should be able to write transactions that access several sites, in the same way as local transactions

Reality • Users may have to be aware of where data is located • Distributed data independence and distributed transaction atomicity may not be supported • These properties may be hard to support efficiently • Sites may be connected by a slow long-distance network • Consider a global system • Administrative overheads for viewing data as a single unified collection may be prohibitively expensive

Distributed vs. Parallel • Distributed and shared-nothing parallel systems appear similar • In practice these are often very different since distributed DBs are typically • Geographically separated • Separately administered • Have slower interconnections • May have both local and global transactions

Types of Distributed Database • Homogeneous • Data is distributed but every site runs the same DBMS software • Heterogeneous, or multidatabase • Different sites run different DBMSs, and the sites are connected to enable access to data • Require standards for gateway protocols • A gateway protocol is an API that allows external applications access to the database • e.g. ODBC and JDBC • Gateways add a layer of processing, and may not be able to entirely mask differences between servers

Distributed DBMS Architecture • Client-Server • Collaborating Server • Middleware

Client-Server Systems • One or more client processes and one or more server processes • A client process sends a query to any one server process • Clients are responsible for UI • Servers manage data and execute transactions • A popular architecture • Relatively simple to implement • Servers do not have to deal with user-interactions • Users can run a GUI on clients • Communication between client and server should be as set-oriented as possible • e.g. stored procedures vs. cursors

Collaborating Server Systems • Client-server systems do not allow a single query to access multiple servers as this would require • Breaking the query into subqueries to be executed at different sites, and merging the answers to the subqueries • To do this the client would have to be overly complex • In a collaborating server system the distinction between clients and servers is eliminated • A collection of DB servers, each able to run transactions against local data • When a query is received that requires data from other servers the server generates appropriate subqueries

Middleware Systems • Designed to allow a single query to access multiple servers, but • Without requiring all servers to be capable of managing multi-site query execution • Often used to integrate legacy systems • Requires one database server (the middleware) capable of managing multi-server queries • Other servers only handle local queries and transactions • The special server coordinates queries and transactions • The middleware server typically doesn’t maintain any data

Distributed Data

Storing Distributed Data • In a distributed system tables are stored across several sites • Accessing a table stored elsewhere incurs message-passing costs • A single table may be replicated or fragmented across several sites • Fragments are stored at the sites where they are most often accessed • Several replicas of a table may be stored at different sites • Fragmentation and replication can be combined

Fragmentation • Fragmentation consists of breaking a table into smaller tables, or fragments • The fragments are stored instead of the original table • Possibly at different sites • Fragmentation can either be vertical or horizontal horizontal vertical

Managing Fragmentation • Records that belong to a horizontal fragment are usually identified by a selection query • e.g. all the records that relate to a particular city, achieving locality, reducing communication costs • A horizontally fragmented table can be recreated by computing the union of the fragments • Fragments are usually required to be disjoint • Records belonging to a vertical fragment are identified by a projection query • The collection of vertical fragments must be a lossless-join decomposition • A unique tuple ID is often assigned to records

Replication • Replication entails storing several copies of a table or of table fragments for • Increased availability of data, which protects against • Failure of individual sites, and • Failure of communication links • Faster query evaluation • Queries can execute faster by using a local copy of a table • There are two kinds of replication, synchronous, and asynchronous • These differ in how replicas are kept current when the table is modified

Managing Distributed Catalogs • Distributing data across sites adds complexity • It is important to track where replicated or fragmented tables are stored • Each replica or fragment must be uniquely named • Naming should be performed locally • A global relation name consists of {birth site, local name} • The birth site is the site where the table was created • A site catalog records fragments and replicas at a site, and tracks replicas of tables created at the site • To locate a table, look up its birth site catalog • The birth site never changes, even if the table is moved

Distributed Queries

Distributed Databases