1 / 31

Parallel Databases

Parallel Databases. Ideal Parallel Systems. Two key properties: Linear Speedup: Twice as much hardware can perform the task in half the elapse time ( i.e. , speedup = number of processors.)

tosca
Download Presentation

Parallel Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Databases Parallel Databases

  2. Ideal Parallel Systems Two key properties: • Linear Speedup: Twice as much hardware can perform the task in half the elapse time (i.e., speedup = number of processors.) • Linear Scaleup: Twice as much hardware can perform twice as large a task in the same elapsed time(i.e., scaleup = 1.) Parallel Databases

  3. Barriers to Parallelism • Startup: The time needed to start a parallel operation (thread creation/connection overhead) may dominate the actual computation time. • Interference: When accessing shared resources, each new process slows down the others (hot spot problem). • Skew: The response time of a set of parallel processes is the time of the slowest one. Parallel Databases

  4. The Challenges • The ideal database machine has: • A single infinitely fast processor. • An infinitely large memory with infinite bandwidth. Unfortunately, technology is not delivering such machines. • The challenges are: • To build an infinitely fast processor out of infinitely many processors of finite speed. • To build an infinitely large memory with infinitely many storage units of finite speed. Parallel Databases

  5. Why Parallel Databases? • High-performance, low-cost commodity components have recently become available. • Microprocessor-based systems are much cheaper than traditional mainframes. • Widespread adoption of the relational data model. • Relational data model is ideally suited to parallel execution. • Terabyte online databases are becoming common as the price of online storage decreases. • It is difficult to build mainframes powerful enough to meet the I/O demands of large relational databases. Parallel Databases

  6. Memory Disk Memory ... Communication network … Disk Communication network … ... P1 ... Pn P1 Pn P0 P0 Shared-Everything (SE) Shared-Disk (SD) Hardware Architecture • nCUBE/2 • Original Digital VAX cluster • Sun Fire 3800-6800-15000(72) • IBM pSeries 610 - 690(32) • IBM 3090 series • Digital VAX • Sequent Symmetry Parallel Databases

  7. Communication network P1 Pn P0 ... Memory Disk Shared-Nothing (SN) Shared-Nothing Architecture Consensus: Shared-Nothing architecture is most scalable to support very large databases. Processing Node (PN) • IBM SP/2 • Teradata DBC/1012 • Tandem Parallel Databases

  8. IBM RS/6000 SP • It allows for up to 8,192 individual processors to be combined and managed as a single system. • Processors are packaged in shared memory nodes of up to 16 processors each. • IBM's well-planned roadmap for the SP allows customers to start small and scale up to larger, more powerful systems. • This may entail adding nodes without having to replace existing hardware -- ensuring long-term investment protection as operating needs grow. Parallel Databases

  9. Parallel Database Servers • Tandem NonStop SQL • Informix: Online 7.0 supported SE environment with the Informix Parallel Data Query (PDQ). Its 8.0 version supports SN computer. • Oracle: two products, Parallel Server and Parallel Query Option (PQO). • Sybase: Navigation Server. • AT&T Global Information Solutions (GIS). • IBM DB2 Parallel Edition: Supports the IBM SP2 SN multiprocessor. Parallel Databases

  10. Process Structure for PDB SQL Results • Query optimization • Query Scheduling • Data placement • SE, SD, SN architectures Parallel Databases

  11. C INSERT JOIN SCAN SCAN Table A Table B Parallelism in Relational Data Model • Pipeline Parallelism: If one operator sends its output to another, the two operators can execute in parallel. Parallel Databases

  12. C0 C1 C2 INSERT INSERT INSERT JOIN SCAN SCAN A2 B1 A1 B0 A0 • Partitioned Parallelism: By taking the large relational operators and partitioning their inputs and outputs, it is possible to turn one big job into many concurrent independent little ones. Parallel Databases

  13. Data Partitioning Strategies • There are two problems for SN architecture: • The degree of parallelism is determined by the physical layout of the data across the PNs. • Its performance is very sensitive to the skewness in data distributed. • Partitioned data is the key to partitioned execution: • Round-Robin • Hash Partitioning • Range Partitioning Parallel Databases

  14. Round-Robin Partitioning • It maps the ith tuple to disk i mod n, where n is the number of disks. • Advantage: It’s simple. • Disadvantage: It does not support associative search. D0 D1 D2 D3 Records Parallel Databases

  15. D0 D1 D2 D3 Hash Hash Partitioning • It maps each tuple to a disk location based on a hash function. • Advantage: Associative access to the tuples with a specific attribute value can be directed to a single disk. • Disadvantage: It tends to randomize data rather than cluster it. Parallel Databases

  16. Range Partitioning • It maps contiguous attribute ranges of a relation to various disks. • Advantage: It is good for associative search and clustering data. • Disadvantage: It risks execution skew in which all the execution occurs in one partition. D0 D1 D2 D3 A~F G~L M~R S~Z Parallel Databases

  17. Horizontal Data Partitioning Parallel Databases

  18. Problems for Horizontal Partitioning • Query 1: Retrieve the names of students who have a GPA better than 2.0.  Only P2 and P3 can participate. • In a multi-user environment,the system can effectively use all the remaining PNs for other queries (generally not achievable). • Query 2: Retrieve the names of students who major in Computer Science.  The whole file must be searched. • It cannot be easily addresses. Parallel Databases

  19. To Address the Problem • The relation is horizontally partitioned and distributed across the PNs. Locally, each partition is organized as a grid file. • The relation is partitioned using multiple attributes. Locally, each partition can be organized as a grid file (investigated in most of researches) . Parallel Databases

  20. Age 60 55 50 45 40 35 30 25 1-attribute query 2-attribute query 20 30 40 50 60 70 80 90 Salary (K) Multidimensional Data Partitioning Parallel Databases

  21. Advantage of MDP • Degree of parallelism is maximized (using as many processing nodes as possible). • Search space is minimized (searching only relevant data blocks). Parallel Databases

  22. Query Types • Query Shape: The shape of the data sub-space accessed by a range query. • Square Query: The query shape is a square. • Row Query: The query shape is a rectangle containing a number of rows. • Column Query: The query shape is a rectangle containing a number of column. Parallel Databases

  23. A 16×16 DM example Disk Modulo (DM) Allocation Parallel Databases

  24. Disk Modulo • Advantage: optimal for row and column queries. • Disadvantage: poor for square queries. Parallel Databases

  25. Hilbert Curve Allocation Method(HCAM) 16×16 HCAM Hilbert Curve Parallel Databases

  26. HCAM • HCAM is based on the idea of space filling curves. • A space filling curve visits all points in a k-dimensional space grid exactly once and never crosses itself. • Advantages: good for square range queries. • Disadvantages: poor for row and column queries. Parallel Databases

  27. 16 ×16 GeMDA General Multidimensional Data Allocation 2-D GeMDA Parallel Databases

  28. 2-D GeMDA • Regular Rows: Circular left shift   positions. • Check Rows: Circular left shift  +1 positions. • Number of check rows: GCD( , N) - 1 Advantages: optimal for row, column, and small square range queries (|Q| <  2). N is the number of PNs Parallel Databases

  29. 3-D GeMDA Parallel Databases

  30. Mapping Function For GeMDA Parallel Databases

  31. Optimality Comparison Parallel Databases

More Related