1 / 34

Parallel DBMS

Parallel DBMS. The need for high performance DB systems. Objectives. Parallel DBMS – what is it and why? A look at Oracle’s approach to Parallel DB Reading, David DeWitt and Jim Gray, Parallel Database Systems: The future of high performance database systems

tasya
Download Presentation

Parallel DBMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel DBMS The need for high performance DB systems

  2. Objectives • Parallel DBMS – what is it and why? • A look at Oracle’s approach to Parallel DB • Reading, David DeWitt and Jim Gray, Parallel Database Systems: The future of high performance database systems • J:\bit\IT7309\ParallelDB \p85-dewitt.pdf

  3. Rough timeline • Ted Codd’s paper, early 70s • System R, Ingres, mid-late 70s • Oracle, IBM’s DB2, early 80s • Development of distributed then parallel (clustered) DB technology late80s, early 90s - today • The key – declarative languages and data independence.

  4. The problem (example) • How do we handle massive volumes of data (for example in Data Warehouses), large numbers of users – this is increasingly the trend. • CPUs process data in nanoseconds, • Disk drives deliver data in milliseconds

  5. Why we need parallel • We have databases that hold a colossal amounts of data, in the order of 1015 bytes. • Amazon 42 Terabytes • World Data Centre for Climate 220 Terabytes • We have data applications that need to process data at very high speeds. • Scalability – as databases grow we need a cost effective way to scale up our systems to meet that growth. • High availability – zero downtime. • Single processors are simply not up to the task

  6. Benefits • Improves Response Time. INTERQUERY PARALLELISM It is possible to process a number of distinct transactions in parallel with each other. • Improves Throughput. INTRAQUERY PARALLELISM It is possible to process ‘sub-tasks’ of an individual transaction in parallel with each other.

  7. More benefits • Higher Throughput • Fault tolerance • Price/Performance • 24 x 7 • Always available • Never down

  8. Parallelisms goals Speed-Up. As you multiply resources by a certain factor, the time taken to execute a transaction should be reduced by the same factor: 10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

  9. Parallelisms goals Scale-up. As you multiply resources the size of a task that can be executed in a given time should be increased by the same factor. 1 second to scan a DB of 1,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

  10. 2000/Sec 1600/Sec Sub-linear speed-up 1000/Sec 16 CPUs 10 CPUs 5 CPUs Speed Up Linear speed-up (ideal) Number of transactions/second Number of CPUs

  11. 1000/Sec 900/Sec Sub-linear scale-up 10 CPUs 2 GB Database 5 CPUs 1 GB Database Scale Up Linear scale-up (ideal) Number of transactions/second Number of CPUs, Database size

  12. Architectures • SMP, Symmetric multiprocessing – typically standalone machines with multiple processors, from 2 up to 64. A CPUs share the same memory, bus and I/O system. Single copy of the OS drives the CPUs • MPP, Massively Parallel Processing – several nodes connected Each node has its own CPU, memory, bus, disks, and I/O system. Each node runs its own copy of the operating system. The number of nodes in an MPP system can vary from two all the way to several thousand.

  13. Architectures • Clustered systems - A clustered system consists of several nodes loosely coupled using local area network (LAN) interconnection technology. Each of these nodes can be a single-processor machine or SMP machine. In a cluster, system software balances the workload among the nodes and provides for high availability.

  14. Architectures • Non Uniform Memory Access (NUMA) - consist of several SMP systems that are interconnected in order to form a larger system. All of the memory in all of the SMP systems are connected together to form a single large memory space. NUMA systems run one copy of the operating system across all nodes.

  15. 3 architectures explained • See Oracle/DB2 Comparison in wikibooks for a good detailed explanation • Shared Disk • Shared Nothing • Shared Everything • Plenty of debate about which is best

  16. Shared Nothing

  17. Shared Nothing • Database is logically a single database • But, data is partitioned in the cluster – nodes hold a chunk of the data. Read/Write process need to know where it is. • You need to decide how you will partition data. • Over time as data volume grows data may become skewed causing it to become suboptimal. Degradation in performance. • Data partitioning always involves a trade off – partitioning one way may be good for certain things. Expensive and time consuming.

  18. Data Partitioning

  19. Some basic strategies

  20. Strategies cont. • Range • Based on ranges of data – date is most common. • Hash • A hashing algorithm is used that is then applied to the data distributing data evenly. Best way to group unorganised data that have no obvious key or is not historical. • List • Explicit control based on a key value (such as States, Regions, Product Categories etc.)

  21. Shared Disk

  22. Shared Disk • Cluster all accessing a single disk farm. (RAID array). • No need for replication or partitioning although RAID arrays usually strip or replicate data – this is automated. • A commonly implemented architecture

  23. Shared Everything

  24. Shared Everything • Essentially a single logical database existing on a multicore environment. • Memory and Processing shared • Disk shared • A general extension of the standard environment • Used by Oracle and DB2

  25. Parallel execution • Divide a (SQL) task among multiple processes in order to complete the task faster • From Oracle 8i it is embedded into RDBMS • Consider ; • SQL> SELECT COUNT(*) FROM orders; • Without parallel processing, a single process scans the orders table and counts the number of rows

  26. Parallel execution • On a four processor machine • Order table is split into four parts, a process is started on each CPU – and four parts of the table are scanned simultaneously

  27. Case Studies • Oracle 11g Real Application Clusters (RAC) • First introduced with 9i • Clustered Parallel DB environment • Shared Cache (their terminology) • A variation on shared everything • Fault tolerance • Performance • Scalability • No single point of failure

  28. Oracle RAC

  29. Oracle RAC • Servers share storage • Redundant interconnect – connecting nodes provide failover • Support for up to 100 Nodes – must be same OS and version of Oracle • Nodes can be SMP machines

  30. Teradata • Develops databases machines for VLDB • Specialises in Data Warehouse and Analytic applications • Massively Parallel Processing (MPP) environment running shared nothing architecture. (so, different to Oracle’s approach) • Runs on Windows and SUSE Linux • Parallel everything • Data Warehouse – built in.

  31. Shared Nothing (Teradata)

  32. Teradata

  33. Teradata (architecture)

  34. Conclusions • Industry wants cheap fast database • Proprietary no longer seems valued • Shared memory seems to provide highest performance but not as scalable as shared nothing environments. • Shared nothing also provides more robust availability – better for larger environments • Shared everything better for small to med size orgs • Shared nothing – requires careful thought about data partitioning strategies.

More Related