Parallel Databases

Parallel Databases Parallel Databases

Ideal Parallel Systems Two key properties: • Linear Speedup: Twice as much hardware can perform the task in half the elapse time (i.e., speedup = number of processors.) • Linear Scaleup: Twice as much hardware can perform twice as large a task in the same elapsed time(i.e., scaleup = 1.) Parallel Databases

Barriers to Parallelism • Startup: The time needed to start a parallel operation (thread creation/connection overhead) may dominate the actual computation time. • Interference: When accessing shared resources, each new process slows down the others (hot spot problem). • Skew: The response time of a set of parallel processes is the time of the slowest one. Parallel Databases

The Challenges • The ideal database machine has: • A single infinitely fast processor. • An infinitely large memory with infinite bandwidth. Unfortunately, technology is not delivering such machines. • The challenges are: • To build an infinitely fast processor out of infinitely many processors of finite speed. • To build an infinitely large memory with infinitely many storage units of finite speed. Parallel Databases

Why Parallel Databases? • High-performance, low-cost commodity components have recently become available. • Microprocessor-based systems are much cheaper than traditional mainframes. • Widespread adoption of the relational data model. • Relational data model is ideally suited to parallel execution. • Terabyte online databases are becoming common as the price of online storage decreases. • It is difficult to build mainframes powerful enough to meet the I/O demands of large relational databases. Parallel Databases

Memory Disk Memory ... Communication network … Disk Communication network … ... P1 ... Pn P1 Pn P0 P0 Shared-Everything (SE) Shared-Disk (SD) Hardware Architecture • nCUBE/2 • Original Digital VAX cluster • Sun Fire 3800-6800-15000(72) • IBM pSeries 610 - 690(32) • IBM 3090 series • Digital VAX • Sequent Symmetry Parallel Databases

Communication network P1 Pn P0 ... Memory Disk Shared-Nothing (SN) Shared-Nothing Architecture Consensus: Shared-Nothing architecture is most scalable to support very large databases. Processing Node (PN) • IBM SP/2 • Teradata DBC/1012 • Tandem Parallel Databases

IBM RS/6000 SP • It allows for up to 8,192 individual processors to be combined and managed as a single system. • Processors are packaged in shared memory nodes of up to 16 processors each. • IBM's well-planned roadmap for the SP allows customers to start small and scale up to larger, more powerful systems. • This may entail adding nodes without having to replace existing hardware -- ensuring long-term investment protection as operating needs grow. Parallel Databases

Parallel Database Servers • Tandem NonStop SQL • Informix: Online 7.0 supported SE environment with the Informix Parallel Data Query (PDQ). Its 8.0 version supports SN computer. • Oracle: two products, Parallel Server and Parallel Query Option (PQO). • Sybase: Navigation Server. • AT&T Global Information Solutions (GIS). • IBM DB2 Parallel Edition: Supports the IBM SP2 SN multiprocessor. Parallel Databases

Process Structure for PDB SQL Results • Query optimization • Query Scheduling • Data placement • SE, SD, SN architectures Parallel Databases

C INSERT JOIN SCAN SCAN Table A Table B Parallelism in Relational Data Model • Pipeline Parallelism: If one operator sends its output to another, the two operators can execute in parallel. Parallel Databases

C0 C1 C2 INSERT INSERT INSERT JOIN SCAN SCAN A2 B1 A1 B0 A0 • Partitioned Parallelism: By taking the large relational operators and partitioning their inputs and outputs, it is possible to turn one big job into many concurrent independent little ones. Parallel Databases

Data Partitioning Strategies • There are two problems for SN architecture: • The degree of parallelism is determined by the physical layout of the data across the PNs. • Its performance is very sensitive to the skewness in data distributed. • Partitioned data is the key to partitioned execution: • Round-Robin • Hash Partitioning • Range Partitioning Parallel Databases

Round-Robin Partitioning • It maps the ith tuple to disk i mod n, where n is the number of disks. • Advantage: It’s simple. • Disadvantage: It does not support associative search. D0 D1 D2 D3 Records Parallel Databases

D0 D1 D2 D3 Hash Hash Partitioning • It maps each tuple to a disk location based on a hash function. • Advantage: Associative access to the tuples with a specific attribute value can be directed to a single disk. • Disadvantage: It tends to randomize data rather than cluster it. Parallel Databases

Range Partitioning • It maps contiguous attribute ranges of a relation to various disks. • Advantage: It is good for associative search and clustering data. • Disadvantage: It risks execution skew in which all the execution occurs in one partition. D0 D1 D2 D3 A~F G~L M~R S~Z Parallel Databases

Horizontal Data Partitioning Parallel Databases

Problems for Horizontal Partitioning • Query 1: Retrieve the names of students who have a GPA better than 2.0.  Only P2 and P3 can participate. • In a multi-user environment,the system can effectively use all the remaining PNs for other queries (generally not achievable). • Query 2: Retrieve the names of students who major in Computer Science.  The whole file must be searched. • It cannot be easily addresses. Parallel Databases

To Address the Problem • The relation is horizontally partitioned and distributed across the PNs. Locally, each partition is organized as a grid file. • The relation is partitioned using multiple attributes. Locally, each partition can be organized as a grid file (investigated in most of researches) . Parallel Databases

Age 60 55 50 45 40 35 30 25 1-attribute query 2-attribute query 20 30 40 50 60 70 80 90 Salary (K) Multidimensional Data Partitioning Parallel Databases

Advantage of MDP • Degree of parallelism is maximized (using as many processing nodes as possible). • Search space is minimized (searching only relevant data blocks). Parallel Databases

Query Types • Query Shape: The shape of the data sub-space accessed by a range query. • Square Query: The query shape is a square. • Row Query: The query shape is a rectangle containing a number of rows. • Column Query: The query shape is a rectangle containing a number of column. Parallel Databases

A 16×16 DM example Disk Modulo (DM) Allocation Parallel Databases

Disk Modulo • Advantage: optimal for row and column queries. • Disadvantage: poor for square queries. Parallel Databases

Hilbert Curve Allocation Method(HCAM) 16×16 HCAM Hilbert Curve Parallel Databases

HCAM • HCAM is based on the idea of space filling curves. • A space filling curve visits all points in a k-dimensional space grid exactly once and never crosses itself. • Advantages: good for square range queries. • Disadvantages: poor for row and column queries. Parallel Databases

16 ×16 GeMDA General Multidimensional Data Allocation 2-D GeMDA Parallel Databases

2-D GeMDA • Regular Rows: Circular left shift   positions. • Check Rows: Circular left shift  +1 positions. • Number of check rows: GCD( , N) - 1 Advantages: optimal for row, column, and small square range queries (|Q| <  2). N is the number of PNs Parallel Databases

3-D GeMDA Parallel Databases

Mapping Function For GeMDA Parallel Databases

Optimality Comparison Parallel Databases

Parallel Databases