350 likes | 567 Views
Parallel DBMS. The need for high performance DB systems. Objectives. Parallel DBMS – what is it and why? A look at Oracle’s approach to Parallel DB Reading, David DeWitt and Jim Gray, Parallel Database Systems: The future of high performance database systems
E N D
Parallel DBMS The need for high performance DB systems
Objectives • Parallel DBMS – what is it and why? • A look at Oracle’s approach to Parallel DB • Reading, David DeWitt and Jim Gray, Parallel Database Systems: The future of high performance database systems • J:\bit\IT7309\ParallelDB \p85-dewitt.pdf
Rough timeline • Ted Codd’s paper, early 70s • System R, Ingres, mid-late 70s • Oracle, IBM’s DB2, early 80s • Development of distributed then parallel (clustered) DB technology late80s, early 90s - today • The key – declarative languages and data independence.
The problem (example) • How do we handle massive volumes of data (for example in Data Warehouses), large numbers of users – this is increasingly the trend. • CPUs process data in nanoseconds, • Disk drives deliver data in milliseconds
Why we need parallel • We have databases that hold a colossal amounts of data, in the order of 1015 bytes. • Amazon 42 Terabytes • World Data Centre for Climate 220 Terabytes • We have data applications that need to process data at very high speeds. • Scalability – as databases grow we need a cost effective way to scale up our systems to meet that growth. • High availability – zero downtime. • Single processors are simply not up to the task
Benefits • Improves Response Time. INTERQUERY PARALLELISM It is possible to process a number of distinct transactions in parallel with each other. • Improves Throughput. INTRAQUERY PARALLELISM It is possible to process ‘sub-tasks’ of an individual transaction in parallel with each other.
More benefits • Higher Throughput • Fault tolerance • Price/Performance • 24 x 7 • Always available • Never down
Parallelisms goals Speed-Up. As you multiply resources by a certain factor, the time taken to execute a transaction should be reduced by the same factor: 10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs
Parallelisms goals Scale-up. As you multiply resources the size of a task that can be executed in a given time should be increased by the same factor. 1 second to scan a DB of 1,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs
2000/Sec 1600/Sec Sub-linear speed-up 1000/Sec 16 CPUs 10 CPUs 5 CPUs Speed Up Linear speed-up (ideal) Number of transactions/second Number of CPUs
1000/Sec 900/Sec Sub-linear scale-up 10 CPUs 2 GB Database 5 CPUs 1 GB Database Scale Up Linear scale-up (ideal) Number of transactions/second Number of CPUs, Database size
Architectures • SMP, Symmetric multiprocessing – typically standalone machines with multiple processors, from 2 up to 64. A CPUs share the same memory, bus and I/O system. Single copy of the OS drives the CPUs • MPP, Massively Parallel Processing – several nodes connected Each node has its own CPU, memory, bus, disks, and I/O system. Each node runs its own copy of the operating system. The number of nodes in an MPP system can vary from two all the way to several thousand.
Architectures • Clustered systems - A clustered system consists of several nodes loosely coupled using local area network (LAN) interconnection technology. Each of these nodes can be a single-processor machine or SMP machine. In a cluster, system software balances the workload among the nodes and provides for high availability.
Architectures • Non Uniform Memory Access (NUMA) - consist of several SMP systems that are interconnected in order to form a larger system. All of the memory in all of the SMP systems are connected together to form a single large memory space. NUMA systems run one copy of the operating system across all nodes.
3 architectures explained • See Oracle/DB2 Comparison in wikibooks for a good detailed explanation • Shared Disk • Shared Nothing • Shared Everything • Plenty of debate about which is best
Shared Nothing • Database is logically a single database • But, data is partitioned in the cluster – nodes hold a chunk of the data. Read/Write process need to know where it is. • You need to decide how you will partition data. • Over time as data volume grows data may become skewed causing it to become suboptimal. Degradation in performance. • Data partitioning always involves a trade off – partitioning one way may be good for certain things. Expensive and time consuming.
Strategies cont. • Range • Based on ranges of data – date is most common. • Hash • A hashing algorithm is used that is then applied to the data distributing data evenly. Best way to group unorganised data that have no obvious key or is not historical. • List • Explicit control based on a key value (such as States, Regions, Product Categories etc.)
Shared Disk • Cluster all accessing a single disk farm. (RAID array). • No need for replication or partitioning although RAID arrays usually strip or replicate data – this is automated. • A commonly implemented architecture
Shared Everything • Essentially a single logical database existing on a multicore environment. • Memory and Processing shared • Disk shared • A general extension of the standard environment • Used by Oracle and DB2
Parallel execution • Divide a (SQL) task among multiple processes in order to complete the task faster • From Oracle 8i it is embedded into RDBMS • Consider ; • SQL> SELECT COUNT(*) FROM orders; • Without parallel processing, a single process scans the orders table and counts the number of rows
Parallel execution • On a four processor machine • Order table is split into four parts, a process is started on each CPU – and four parts of the table are scanned simultaneously
Case Studies • Oracle 11g Real Application Clusters (RAC) • First introduced with 9i • Clustered Parallel DB environment • Shared Cache (their terminology) • A variation on shared everything • Fault tolerance • Performance • Scalability • No single point of failure
Oracle RAC • Servers share storage • Redundant interconnect – connecting nodes provide failover • Support for up to 100 Nodes – must be same OS and version of Oracle • Nodes can be SMP machines
Teradata • Develops databases machines for VLDB • Specialises in Data Warehouse and Analytic applications • Massively Parallel Processing (MPP) environment running shared nothing architecture. (so, different to Oracle’s approach) • Runs on Windows and SUSE Linux • Parallel everything • Data Warehouse – built in.
Conclusions • Industry wants cheap fast database • Proprietary no longer seems valued • Shared memory seems to provide highest performance but not as scalable as shared nothing environments. • Shared nothing also provides more robust availability – better for larger environments • Shared everything better for small to med size orgs • Shared nothing – requires careful thought about data partitioning strategies.