Parallel Database System: The Future of High Performance Database Systems

Parallel Database System: The Future of High Performance Database Systems Present by: Suresh Babu L

Outline • Why parallel Databases? • Scale up and Speedup • Parallel DB’s Architectures • Parallel Data Flow • Data Partitioning • Parallelism with Relational Operators • The State of the Art

Why Parallel Databases? Edgar F.Codd

1,000 x parallel 100 second SCAN. 1 Terabyte 1 Terabyte BANDWIDTH 10 GB/s 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. Parallel Access to Data At 10 MB/s 1.2 days to scan

Pipeline Any Any Sequential Sequential Program Program Sequential Sequential Any Any Sequential Sequential Sequential Sequential Partition outputs split N ways inputs merge M ways Program Program Parallel DBMS: Intro • Pipeline parallelism: • Pipeline partition:

Pipelined and Partitioned Parallelism • Both are natural in DBMS! Pipeline parallelism Partitioned data allows partitioned parallelism Merge Sort Sort Sort Sort Sort Scan Scan Scan Scan Scan Source Data Source Data Source Data Source Data Source Data

Scale-Up And Speed-Up • Speedup • Scale-up: 1TB 100GB 100GB 100GB

A Bad Speedup Curve 3-Factors Interference Skew Startup Processers & Discs Barriers to Achieving Linear Speedup and Scaleup

Architectures for Parallel DBs • Shared memory: • Shared –disks: IBM/370 ,Sequent, SGI, Sun VMScluster, Sysplex

Architectures for Parallel DBs(contd.) • Shared Nothing: Tandem, Teradata, SP2

Architectures (contd.) • Shared Nothing • Teradata: 400 nodes • 80x12 nodes • Tandem: 110 nodes • IBM / SP2 / DB2: 128 nodes • Informix/SP2 100 nodes • ATT & Sybase 8x14 nodes • Shared Disk • Oracle 170 nodes • Rdb 24 nodes • Shared Memory • Informix 9 nodes • RedBrick ? nodes

Parallel Data Flow and Relational Systems Merge Sort Sort Sort Sort Scan Scan Scan Scan Source Data Source Data Source Data Source Data

Data Partitioning • Three main techniques: • Round Robin • Hash Partitioning • Range partitioning

Round Robin Partitioning …. P2 P1 Pn …..

Hash Partitioning …. P2 P1 Pn

Range Partitioning …. …… P2 P1 Pn a….c d…..g w…z

Parallelism with Relational Operators • Two basic operations: • Merge • Split

Merge Operation

Split Operation • Split • Used to partition or replicate the stream produced by a relational operator

Example of Parallelizing Relational Operators C A B INSERT JOIN SCAN SCAN

Example (contd.)

The State of the Art • Teradata • Tandem Nonstop sql • Gamma • The super database computer • Bubba

Specialized Parallel Relational Operators • Algorithms for traditional relational operators written to improve their parallel execution, to better handle data and execution skew. • Look at join • Sort merge • Hash join

CONCLUSION

THANK YOU QUESTIONS ?

Parallel Database System: The Future of High Performance Database Systems