1 / 51

Parallel & Distributed Database Systems

Parallel & Distributed Database Systems. Reid Exley November 3, 2005. Overview. Parallel Databases Background Architectures Applications Distributed Databases Background Characteristics Applications. Parallel Databases. What is the PROBLEM?. Performance Mainframes expensive

sabine
Download Presentation

Parallel & Distributed Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel & Distributed Database Systems Reid Exley November 3, 2005

  2. Overview • Parallel Databases • Background • Architectures • Applications • Distributed Databases • Background • Characteristics • Applications

  3. Parallel Databases

  4. What is the PROBLEM? • Performance • Mainframes expensive • I/O bottleneck (speed(disk) < speed(RAM) < speed(cpu))

  5. What is a parallel database? • Definition: a storage and retrieval method that seeks to improve performance through parallelization of various operations, such as loading data, building indexes, and evaluating queries

  6. Why parallel databases? • Relational model affords it • SQL can easily be ported to parallel processing • Hardware trends • Decreasing costs of servers • Increasing speed of hardware • Complex queries needed to be solved • Data warehouse • Search engines

  7. 3 Three Architectures • Shared-memory system • Shared-disk system • Shared-nothing system

  8. Shared-memory system • Multiple CPUs attached to an interconnection network and can access a common region of main memory • Closest to a conventional machine DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

  9. What’s wrong with it? • Memory contention becomes a bottleneck as the # of CPUs increases

  10. Shared-disk system • Each CPU has a private memory and direct access to all disks through an interconnection network DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

  11. What’s wrong with it? • Disks over the network become the bottleneck • INTERFERENCE: as more CPUs are added, existing CPUs are slowed down because of the increased contention for memory accesses and network bandwidth

  12. Shared-nothing system • Each CPU has local main memory and disk space, but no two CPUs can access the same storage area • All communication between CPUs is through a network connection • Considered the best parallel architecture DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

  13. Disadvantage: • Requires more extensive reorganization of the DBMS code • Load Balancing complex

  14. Advantage: • Provides linear speed-up: Twice as much hardware can perform the task in half the elapsed time • Provides linear scale-up: Twice as much hardware can perform twice as large a task in the same elapsed time DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

  15. How is parallelism achieved? • Pipelining • Data Partitioning

  16. Pipelining • The output of the second operator is worked on by the first operator as soon as it is generated Module A Module B Module C Time = 0, 1, 2 Module D Time = 1, 2 Module E Time = 2

  17. Pipelining Challenges • Blocking – an operator is blocking if it does not output until it has consumed all inputs • Sorting & Aggregation • Blocking kills pipeline parallelism • Relational pipelines are rarely very long • Execution cost of one operation is sometimes much larger than others (skew)

  18. Data Partitioning • Spreads I/O and computation among processors

  19. Partitioning Methods • Range • Hash • Round Robin

  20. Range • Rows are sorted, and n ranges are chosen for the sort key values, so each range contains roughly the same # of rows • Rows in range i are assigned to processor i • Considerations: • Good for equijoins, range queries, group by • May need to sample data from disks to get even distribution (disk skew)

  21. Hash • Hash function is applied to fields of a row to determine its processor • Considerations: • Good for equijoins • Good for accessing a subset of a relation

  22. Round-robin • If there are n processors, the ith tuple is assigned to processor i mod n • Considerations: • Good if accessing the entire table or relation • Good for load balance

  23. Node 1 Node 2 R2: R1: S1 S2 Node 3 Node 4 Parallel Join Example

  24. Real-life Application • Search engines • Google

  25. Open Problems • Database query optimizers do not consider all possible plans • min{cost = data transmission + local processing} • Highly skewed value distributions • Hybrid Architectures • Support of higher functionality such as rules and objects

  26. Distributed Databases

  27. What is the PROBLEM? • Availability • Redundancy • Local Ownership • Data needs are often local • Accessing Distributed Data • Data at multiple sites must be accessible

  28. What is it? • A collection of multiple, logically interrelated databases distributed over a computer network

  29. Characteristics • Distributed Data Independence • User should not have to know where data is located • Distributed Transaction Atomicity • Database should appear to be one local database • Atomic: All or nothing transactions • Data is Logically Related • Most DDB are relational

  30. Data Independence

  31. Characteristics • Distributed Data Independence • User should not have to know where data is located • Distributed Transaction Atomicity • Database should appear to be one local database • Atomic: All or nothing transactions • Data is Logically Related • Most DDB are relational

  32. Logical View

  33. Physical View

  34. Characteristics • Distributed Data Independence • User should not have to know where data is located • Distributed Transaction Atomicity • Database should appear to be one local database • Atomic: All or nothing transactions • Data is Logically Related • Most DDB are relational

  35. Dimensions • Autonomy • How independent are each of the DBs • Distribution • How many DBs are there • Heterogeneity • How different are the data models, query languages, interfaces, and transaction management protocols

  36. Architectures • Client-Server • Collaborating Server • Middleware

  37. Client-server • Client sends query to single site • All query processing done at one server Client Server Query

  38. DB DB DB DB DB Collaborating Server • A collection of db servers that cooperatively execute transactions • A server that generates appropriate subqueries to be executed by other servers and puts the results together when data is distributed Query

  39. Query DB DB Middleware DB DB Result Middleware • Coordinates the execution of queries and transactions across one or more independent database servers; • Does not contain any data itself

  40. Fragmentation • Breaking a relation into smaller relations or fragments and storing the fragments possibly at different sites • Horizontal fragmentation (rows) • Vertical fragmentation (columns) • Hybrid?

  41. Replication • Storing several copies of a relation or fragments • Why? • Increased availability of data (redundancy) • Faster Query Evaluation (local is faster) • Necessary for Data Warehousing

  42. Replication: Two Methods • Synchronous • All copies are synchronize before commit • Asynchronous • Modified copies are only periodically updated • More common • Sacrifices data independence for efficiency • 2 Types: Primary Site or P2P Q. When would you use each method?

  43. Primary Site • One copy of a relation is designated as the primary or master • Secondary copies are made from primary are not directly updated • Publisher & Subscriber

  44. Peer-to-peer • More than one copy can be master • Changes to a master copy must be propagated to other copies • If two master copies are changed in a conflicting manner, it must be resolved (Site 1: Joe’s age changed to 35; Site 2 to 36) • Used best in situations that don’t produce conflicts • Each master site owns a disjoint fragment • Updating rights own by only one master at a time

  45. Distributed Transactions • Distributed Concurrency Control • Distributed Recovery

  46. Distributed Concurrency Control • How do we manage locks for objects across all sites? • Centralized: One site does all locking • Vulnerable to single-site failure • Primary Copy: All locking for an object done at the primary copy for this object • Reading requires access to locking site as well as site where the object is stored • Fully Distributed: Locking for a copy done at site where copy is stored • Locks at all sites while writing an object

  47. Distributed Recovery • Two new issues: • Link and remote system failures • Must ensure atomicity of sub-transactions • Solution: Two-phase commit • Coordinator (originator) asks for a vote (yes or abort) from subordinates (other DBs) • If unanimous yes, then coordinator tells all the others to commit • Coordinator waits for “ack” from all subordinates then commits and ends transaction log

  48. Popular Support • Oracle • Grid Databases/Computing • http://www.oracle.com/technologies/grid/index.html • Mysql • Cluster DBMS • http://www.mysql.com/products/database/cluster/ • DB2, MSSQL Server, etc.

  49. Applications • Manufacturing – especially at multiple locations • Military command and control • Airlines • Hotel Chains • (Any organization which has a decentralized structure)

  50. Open Problems • Increasing performance of locking protocols • Speculative Locking – waiting transactions execute on both the before and after states of preceding transactions prior to a commit (Krishna & Kitsuregawa, 2004)

More Related