Parallel & Distributed Database Systems

Parallel & Distributed Database Systems Reid Exley November 3, 2005

Overview • Parallel Databases • Background • Architectures • Applications • Distributed Databases • Background • Characteristics • Applications

Parallel Databases

What is the PROBLEM? • Performance • Mainframes expensive • I/O bottleneck (speed(disk) < speed(RAM) < speed(cpu))

What is a parallel database? • Definition: a storage and retrieval method that seeks to improve performance through parallelization of various operations, such as loading data, building indexes, and evaluating queries

Why parallel databases? • Relational model affords it • SQL can easily be ported to parallel processing • Hardware trends • Decreasing costs of servers • Increasing speed of hardware • Complex queries needed to be solved • Data warehouse • Search engines

3 Three Architectures • Shared-memory system • Shared-disk system • Shared-nothing system

Shared-memory system • Multiple CPUs attached to an interconnection network and can access a common region of main memory • Closest to a conventional machine DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

What’s wrong with it? • Memory contention becomes a bottleneck as the # of CPUs increases

Shared-disk system • Each CPU has a private memory and direct access to all disks through an interconnection network DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

What’s wrong with it? • Disks over the network become the bottleneck • INTERFERENCE: as more CPUs are added, existing CPUs are slowed down because of the increased contention for memory accesses and network bandwidth

Shared-nothing system • Each CPU has local main memory and disk space, but no two CPUs can access the same storage area • All communication between CPUs is through a network connection • Considered the best parallel architecture DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

Disadvantage: • Requires more extensive reorganization of the DBMS code • Load Balancing complex

Advantage: • Provides linear speed-up: Twice as much hardware can perform the task in half the elapsed time • Provides linear scale-up: Twice as much hardware can perform twice as large a task in the same elapsed time DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

How is parallelism achieved? • Pipelining • Data Partitioning

Pipelining • The output of the second operator is worked on by the first operator as soon as it is generated Module A Module B Module C Time = 0, 1, 2 Module D Time = 1, 2 Module E Time = 2

Pipelining Challenges • Blocking – an operator is blocking if it does not output until it has consumed all inputs • Sorting & Aggregation • Blocking kills pipeline parallelism • Relational pipelines are rarely very long • Execution cost of one operation is sometimes much larger than others (skew)

Data Partitioning • Spreads I/O and computation among processors

Partitioning Methods • Range • Hash • Round Robin

Range • Rows are sorted, and n ranges are chosen for the sort key values, so each range contains roughly the same # of rows • Rows in range i are assigned to processor i • Considerations: • Good for equijoins, range queries, group by • May need to sample data from disks to get even distribution (disk skew)

Hash • Hash function is applied to fields of a row to determine its processor • Considerations: • Good for equijoins • Good for accessing a subset of a relation

Round-robin • If there are n processors, the ith tuple is assigned to processor i mod n • Considerations: • Good if accessing the entire table or relation • Good for load balance

Node 1 Node 2 R2: R1: S1 S2 Node 3 Node 4 Parallel Join Example

Real-life Application • Search engines • Google

Open Problems • Database query optimizers do not consider all possible plans • min{cost = data transmission + local processing} • Highly skewed value distributions • Hybrid Architectures • Support of higher functionality such as rules and objects

Distributed Databases

What is the PROBLEM? • Availability • Redundancy • Local Ownership • Data needs are often local • Accessing Distributed Data • Data at multiple sites must be accessible

What is it? • A collection of multiple, logically interrelated databases distributed over a computer network

Characteristics • Distributed Data Independence • User should not have to know where data is located • Distributed Transaction Atomicity • Database should appear to be one local database • Atomic: All or nothing transactions • Data is Logically Related • Most DDB are relational

Data Independence

Logical View

Physical View

Dimensions • Autonomy • How independent are each of the DBs • Distribution • How many DBs are there • Heterogeneity • How different are the data models, query languages, interfaces, and transaction management protocols

Architectures • Client-Server • Collaborating Server • Middleware

Client-server • Client sends query to single site • All query processing done at one server Client Server Query

DB DB DB DB DB Collaborating Server • A collection of db servers that cooperatively execute transactions • A server that generates appropriate subqueries to be executed by other servers and puts the results together when data is distributed Query

Query DB DB Middleware DB DB Result Middleware • Coordinates the execution of queries and transactions across one or more independent database servers; • Does not contain any data itself

Fragmentation • Breaking a relation into smaller relations or fragments and storing the fragments possibly at different sites • Horizontal fragmentation (rows) • Vertical fragmentation (columns) • Hybrid?

Replication • Storing several copies of a relation or fragments • Why? • Increased availability of data (redundancy) • Faster Query Evaluation (local is faster) • Necessary for Data Warehousing

Replication: Two Methods • Synchronous • All copies are synchronize before commit • Asynchronous • Modified copies are only periodically updated • More common • Sacrifices data independence for efficiency • 2 Types: Primary Site or P2P Q. When would you use each method?

Primary Site • One copy of a relation is designated as the primary or master • Secondary copies are made from primary are not directly updated • Publisher & Subscriber

Peer-to-peer • More than one copy can be master • Changes to a master copy must be propagated to other copies • If two master copies are changed in a conflicting manner, it must be resolved (Site 1: Joe’s age changed to 35; Site 2 to 36) • Used best in situations that don’t produce conflicts • Each master site owns a disjoint fragment • Updating rights own by only one master at a time

Distributed Transactions • Distributed Concurrency Control • Distributed Recovery

Distributed Concurrency Control • How do we manage locks for objects across all sites? • Centralized: One site does all locking • Vulnerable to single-site failure • Primary Copy: All locking for an object done at the primary copy for this object • Reading requires access to locking site as well as site where the object is stored • Fully Distributed: Locking for a copy done at site where copy is stored • Locks at all sites while writing an object

Distributed Recovery • Two new issues: • Link and remote system failures • Must ensure atomicity of sub-transactions • Solution: Two-phase commit • Coordinator (originator) asks for a vote (yes or abort) from subordinates (other DBs) • If unanimous yes, then coordinator tells all the others to commit • Coordinator waits for “ack” from all subordinates then commits and ends transaction log

Popular Support • Oracle • Grid Databases/Computing • http://www.oracle.com/technologies/grid/index.html • Mysql • Cluster DBMS • http://www.mysql.com/products/database/cluster/ • DB2, MSSQL Server, etc.

Applications • Manufacturing – especially at multiple locations • Military command and control • Airlines • Hotel Chains • (Any organization which has a decentralized structure)

Open Problems • Increasing performance of locking protocols • Speculative Locking – waiting transactions execute on both the before and after states of preceding transactions prior to a commit (Krishna & Kitsuregawa, 2004)

Parallel & Distributed Database Systems