Parallel DBMS

Parallel DBMS The need for high performance DB systems

Objectives • Parallel DBMS – what is it and why? • A look at Oracle’s approach to Parallel DB • Reading, David DeWitt and Jim Gray, Parallel Database Systems: The future of high performance database systems • J:\bit\IT7309\ParallelDB \p85-dewitt.pdf

Rough timeline • Ted Codd’s paper, early 70s • System R, Ingres, mid-late 70s • Oracle, IBM’s DB2, early 80s • Development of distributed then parallel (clustered) DB technology late80s, early 90s - today • The key – declarative languages and data independence.

The problem (example) • How do we handle massive volumes of data (for example in Data Warehouses), large numbers of users – this is increasingly the trend. • CPUs process data in nanoseconds, • Disk drives deliver data in milliseconds

Why we need parallel • We have databases that hold a colossal amounts of data, in the order of 1015 bytes. • Amazon 42 Terabytes • World Data Centre for Climate 220 Terabytes • We have data applications that need to process data at very high speeds. • Scalability – as databases grow we need a cost effective way to scale up our systems to meet that growth. • High availability – zero downtime. • Single processors are simply not up to the task

Benefits • Improves Response Time. INTERQUERY PARALLELISM It is possible to process a number of distinct transactions in parallel with each other. • Improves Throughput. INTRAQUERY PARALLELISM It is possible to process ‘sub-tasks’ of an individual transaction in parallel with each other.

More benefits • Higher Throughput • Fault tolerance • Price/Performance • 24 x 7 • Always available • Never down

Parallelisms goals Speed-Up. As you multiply resources by a certain factor, the time taken to execute a transaction should be reduced by the same factor: 10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

Parallelisms goals Scale-up. As you multiply resources the size of a task that can be executed in a given time should be increased by the same factor. 1 second to scan a DB of 1,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

2000/Sec 1600/Sec Sub-linear speed-up 1000/Sec 16 CPUs 10 CPUs 5 CPUs Speed Up Linear speed-up (ideal) Number of transactions/second Number of CPUs

1000/Sec 900/Sec Sub-linear scale-up 10 CPUs 2 GB Database 5 CPUs 1 GB Database Scale Up Linear scale-up (ideal) Number of transactions/second Number of CPUs, Database size

Architectures • SMP, Symmetric multiprocessing – typically standalone machines with multiple processors, from 2 up to 64. A CPUs share the same memory, bus and I/O system. Single copy of the OS drives the CPUs • MPP, Massively Parallel Processing – several nodes connected Each node has its own CPU, memory, bus, disks, and I/O system. Each node runs its own copy of the operating system. The number of nodes in an MPP system can vary from two all the way to several thousand.

Architectures • Clustered systems - A clustered system consists of several nodes loosely coupled using local area network (LAN) interconnection technology. Each of these nodes can be a single-processor machine or SMP machine. In a cluster, system software balances the workload among the nodes and provides for high availability.

Architectures • Non Uniform Memory Access (NUMA) - consist of several SMP systems that are interconnected in order to form a larger system. All of the memory in all of the SMP systems are connected together to form a single large memory space. NUMA systems run one copy of the operating system across all nodes.

3 architectures explained • See Oracle/DB2 Comparison in wikibooks for a good detailed explanation • Shared Disk • Shared Nothing • Shared Everything • Plenty of debate about which is best

Shared Nothing

Shared Nothing • Database is logically a single database • But, data is partitioned in the cluster – nodes hold a chunk of the data. Read/Write process need to know where it is. • You need to decide how you will partition data. • Over time as data volume grows data may become skewed causing it to become suboptimal. Degradation in performance. • Data partitioning always involves a trade off – partitioning one way may be good for certain things. Expensive and time consuming.

Data Partitioning

Some basic strategies

Strategies cont. • Range • Based on ranges of data – date is most common. • Hash • A hashing algorithm is used that is then applied to the data distributing data evenly. Best way to group unorganised data that have no obvious key or is not historical. • List • Explicit control based on a key value (such as States, Regions, Product Categories etc.)

Shared Disk

Shared Disk • Cluster all accessing a single disk farm. (RAID array). • No need for replication or partitioning although RAID arrays usually strip or replicate data – this is automated. • A commonly implemented architecture

Shared Everything

Shared Everything • Essentially a single logical database existing on a multicore environment. • Memory and Processing shared • Disk shared • A general extension of the standard environment • Used by Oracle and DB2

Parallel execution • Divide a (SQL) task among multiple processes in order to complete the task faster • From Oracle 8i it is embedded into RDBMS • Consider ; • SQL> SELECT COUNT(*) FROM orders; • Without parallel processing, a single process scans the orders table and counts the number of rows

Parallel execution • On a four processor machine • Order table is split into four parts, a process is started on each CPU – and four parts of the table are scanned simultaneously

Case Studies • Oracle 11g Real Application Clusters (RAC) • First introduced with 9i • Clustered Parallel DB environment • Shared Cache (their terminology) • A variation on shared everything • Fault tolerance • Performance • Scalability • No single point of failure

Oracle RAC

Oracle RAC • Servers share storage • Redundant interconnect – connecting nodes provide failover • Support for up to 100 Nodes – must be same OS and version of Oracle • Nodes can be SMP machines

Teradata • Develops databases machines for VLDB • Specialises in Data Warehouse and Analytic applications • Massively Parallel Processing (MPP) environment running shared nothing architecture. (so, different to Oracle’s approach) • Runs on Windows and SUSE Linux • Parallel everything • Data Warehouse – built in.

Shared Nothing (Teradata)

Teradata

Teradata (architecture)

Conclusions • Industry wants cheap fast database • Proprietary no longer seems valued • Shared memory seems to provide highest performance but not as scalable as shared nothing environments. • Shared nothing also provides more robust availability – better for larger environments • Shared everything better for small to med size orgs • Shared nothing – requires careful thought about data partitioning strategies.

Parallel DBMS

Parallel DBMS

Presentation Transcript

DBMS Software

Integrating Hadoop and Parallel DBMS

DBMS

DBMS

Integrating Hadoop and Parallel DBMS

Temporal DBMS

Spatial DBMS

Distributed DBMS

Parallel DBMS

DBMS

DBMS

DBMS Architectures

dbms

Parallel DBMS

Parallel DBMS

Parallel DBMS