450 likes | 740 Views
22. Parallel, Distributed Access. Parallel vs Distributed DBMSs Parallel DBMSs Measuring success: scalability Hardware Architectures Types of parallelism, partitioning Parallelizing operators: scan, select, sort, aggregate, join Distributed DBMSs Classifying distributed DBMSs
E N D
22. Parallel, Distributed Access • Parallel vs Distributed DBMSs • Parallel DBMSs • Measuring success: scalability • Hardware Architectures • Types of parallelism, partitioning • Parallelizing operators: scan, select, sort, aggregate, join • Distributed DBMSs • Classifying distributed DBMSs • Hetero- & homo-geneous, client- & collaborating server, horizontal & vertical fragmentation • Distributed catalog management • Distributed query processing • General queries, joins, optimization • Replication • Synchronous: voting, read-any write-all • Asynchronous: Peer to Peer, Primary Site (Capture and Apply) • Distributed Locking, Deadlock Detection, Transactions
Parallel vs. Distributed DBMSs • DBMS may be spread among the nodes of a network • Parallel DBMS • Nodes are under a central control • Typically nodes are in one location • Motivated by performance • Distributed DBMS • Nodes are autonomous, may run different DBMSs • Nodes are physically distributed • Motivated by need for remote access to data in spite of network failures
Motivation for Parallel DBMSs: What if one CPU is not enough? • For a large web application, one CPU is never enough • A large web application may need to scan a 1TB file or service 100s of customers at one time. • At 50MB/second, scanning a 1TB file takes 5 hours • Using 1000 CPUs scanning 1000 disks, the scan will take 20 seconds • Similarly, 100s of customers accessing a database can result in response times in the minutes • Solution: Parallel DBMSs
DBMS: The || Success Story • DBMSs, and Information Retrieval, are the most (only?) successful application of parallelism. • Every major DBMS vendor has some || server • Large DBMS backends, for web applications, are all parallelized. • Large IR applications are parallelized • Reasons for DBMS success: • Relational Algebra has few operators • Just parallelize each one and compose them • Bulk-processing (= partition ||-ism). • Natural pipelining in operator trees.
How do we measure the success of parallel algorithms and architectures? • Scalability • There are two kinds of scalability • Speedup: When the number of nodes grows by a factor n, then so does the efficiency • Scaleup: When the number of nodes and the size of the database both grow by n, then the efficiency does not change • Efficiency can be measured in terms of • Elapsed time, or average time per transaction • Number of transactions per second (throughput)
Shared Memory (SMP) Shared Nothing Shared Disk CLIENTS CLIENTS CLIENTS Processors Memory Architecture Issue: Shared What? Easy to program. Expensive to build. Difficult to scale Beyond ~30 procs Hard to program. Cheap to build. Easy to scale To 1000s of procs Clusters SUN, many others NCR Teradata, IBM SP
How would you parallelize this plan? • Assume Sailors, Reservations are already sorted on sid Sort-merge join ⋈sid On-the-fly rank<5 Sailors Reservations
Different Types of DBMS ||-ism • Pipeline parallelism: Inter-operator • each operator may run concurrently on a different node (may exploit pipelining or bushy plans) • Partition parallelism: Intra-operator • multiple nodes work to compute a given operation (scan, sort, join) • We’ll focus on intra-operator ||-ism • Challenge: How to partition the data!
Three Types of Data Partitioning RangeHashRound Robin A...E F...J F...J T...Z T...Z K...N K...N O...S T...Z F...J K...N O...S A...E O...S A...E Good for equijoins, range queries group-by Good for equijoins Group-by Good to spread load Shared disk and memory less sensitive to partitioning, Shared nothing benefits more from "good" partitioning
How to Parallelize the Scan Operator • Given a partition, what is the parallel version of the scan operator? • Scan each partition, then merge • What kind of partition will make the scan scalable? • Equal size
Parallelizing Selection • What is the parallel version of the selection operator? • Range partition, range selection • Hash partition, range selection • Hash partition, equality selection • Round Robin partition • Which of the above is scalable for speedup? • Range partition, range selection • Hash partition, range selection • Hash partition, equality selection • Round Robin partition
Parallel Sorting: Two Algorithms • Distributive sort: Range partition the data, then sort each range • Is it scalable? Only if partitions are equal sized • How to choose partitioning vector? • Parallel External Sort-Merge: Sort the data at each node, then merge • Problem: how to merge in parallel • One solution: range partition the data • Is it scalable?
Parallel Sorting, ctd. • How to choose partitioning vector? Sampling • What if sampling is not possible? • For example, if input is another process, such as another operator in a query plan • In this case can use external sort-merge and set up a tree of merging steps • See US Patent #5,852,826
Parallel Aggregates • Problem: Compute an aggregate function in parallel • For each aggregate function, need a decomposition: • sum(S) = sum( sum(s(i)) ) • count(S) = sum(count(s(i)) ) • avg(S) = (Ssum(s(i))) /Scount(s(i)) • Is it scalable? • Sometimes it’s not so simple: median( ) • Similar for groups
Partitions OUTPUT 1 1 INPUT 2 2 hash function h . . . Original Relations (R then S) B-1 B-1 B main memory buffers Disk Disk Parallel Hash Join • Recall Phase 1 of Hash Join • Do Phase 1 in parallel. For R, then S: • Perform partitioning at each node • Send hash bucket i’s output to node i, j to j, etc. • Phase 2 is simpler • Perform join of Ri and Si at each node i Phase 1
Dataflow Network for || Join: A ⋈ B • Good use of split/merge makes it easier to build parallel versions of sequential join code. Node J Node I
A B R S Complex Parallel Query Plans • Complex Queries: Inter-Operator parallelism • Pipelining between operators: • note that sort and phase 1 of hash-join block the pipeline!! • Bushy Trees Sites 1-8 Sites 1-4 Sites 5-8
Distributed Databases • Data is stored at several sites, each managed by a DBMS that can run independently. • Distributed Data Independence: Users should not have to know where data is located (extends Physical and Logical Data Independence principles). • Distributed Transaction ACIDITY: Users should be able to write atomic and durable Xacts accessing multiple sites
Types of Distributed Databases • Homogeneous: Every site runs same type of DBMS. • Heterogeneous: Different sites run different DBMSs (different RDBMSs or even non-relational DBMSs). Gateway DBMS1 DBMS2 DBMS3
Network Architectures • Definitions • Client: Requests a service • Server: provides a service • Client-server architecture: one of each • Fat client: Applications run from the client • Thin client: Applications run from the server • Business Applications use Thin Client • Fat Clients are too hard to mange • Personal Applications use Fat Client • Does anyone use Google Docs? • 3-tier architecture. Middle tier runs middleware. • Middlware includes business logic, DBMS coordination • DBMS back end may involve multiple collaborating servers
Availability • Definition: % of time can answer queries • Sometimes only local queries can be answered • What detracts from availability? • Node crashes • Network crashes • During a network (perhaps partial) crash, the DBMS at each node should continue to operate.
Storing Data –Fragmentation • Horizontal Fragmentation • Usually disjoint • Example • Motivation: More efficient to place data where queries are located, if possible • Vertical Fragmentation • Remember Normalization? • Not as common in DDBMSs as horizontal fragmentation
R1 R3 SITE B R1 R2 Storing Data - Replication • Disadvantages (redundancy!) • Wasted space • Possibly inconsistent data values • Advantages • Increased availability • Faster query evaluation • How do we keep track of all replicas of data? SITE A
Distributed Catalog Management • Catalog contains schema, authorization, statistics, and location of each relation/replica/fragment • Problem: where to store the catalog? • Example: I want to query sales. Where do I find it? Location info is in the catalog, but where is the catalog? • Solutions • Store the entire catalog at every site • Catalog updates are too expensive • Store the entire catalog at a single master site • Single point of failure, performance bottleneck • Store catalog info for all replicas and fragments of a relation at the birthsite of each relation • How do I find the birthsite of a relation?
Distributed Catalog Management (ctd.) • Intergalactic standard solution: • Name each relation with <local-name, birth-site> • E.g., <sales, Portland01> • Keep catalog of a relation at its birthplace • How many network I/Os are required to update a catalog? • If a data item’s location is changed, does the birth-site change? How many network I/Os are required?
SELECT AVG(S.age) FROM Sailors S WHERE S.rating > 3 AND S.rating < 7 Distributed Queries • Horizontally Fragmented: Tuples with rating < 5 at Shanghai, >= 5 at Tokyo. • Must compute SUM(age), COUNT(age) at both sites. • If WHERE contained just S.rating>6, just one site. • Vertically Fragmented:sid and rating at Shanghai, sname and age at Tokyo, tid at both. • How to evaluate the query? • Replicated: Sailors copies at both sites. • Choice of site based on local costs, network costs.
LONDON PARIS Distributed Joins Sailors Reserves 500 pages 1000 pages SELECT * FROM Sailors S, Reserves R WHERE S.sid = Reserves.sid and rank=5 • Fetch as Needed, Page NL, Sailors as outer: • Cost: 500 D + 500 * 1000 (D+S)/10 • /10 because of rank=5 condition • D is cost to read/write page; S is cost to ship page. • S is very large, so cost is essentially 50,000S • Ship to One Site: Ship Reserves to London. • Cost: 1000 S + 4500 D (SM Join; cost = 3*(500+1000)) • Essentially 1000S
Semijoin Technique • At London, project rank=5Sailors onto join column sid and ship this to Paris. Cost: 5S • 10% for rank=5, 10% for project onto join column • At Paris, join Sailors projection with Reserves. • Result is called reduction of Reserves wrt Sailors. • Ship reduction of Reserves to London. Cost: 100S • At London, join Sailors with reduction of Reserves. • Total cost: 105S • Idea: Tradeoff the cost of computing and shipping projection for cost of shipping full Reserves relation.
Bloomjoin • At London • Create a 64K bit vector V (one 8K page) • Select h a hash function from sids to [0,64K-1] • If there’s a sailor with rating 5 set V(h(her sid))=true • Ship V to Paris. Cost: 1S • At Paris • If a reservation has V(h(its sid))=true, ship it to London • Cost: 100S • Perhaps a bit more if hash function is not effective, but 64K is a pretty big bit vector • At London • Join Sailors with reduced Reservations. • Total cost: 101S.
Distributed Query Optimization • Cost-based approach; consider plans, pick cheapest; similar to centralized optimization. • Difference 1: Communication costs must be considered. • Difference 2: Local site autonomy must be respected. • Difference 3: New distributed join methods. • Query site constructs global plan, with suggested local plans describing processing at each site. • If a site can improve suggested local plan, free to do so.
Updating Distributed Data • Synchronous Replication: All copies of a modified relation (fragment) must be updated before the modifying Xact commits. • Data distribution is made transparent to users. • Asynchronous Replication: Copies of a modified relation are only periodically updated; different copies may get out of synch in the meantime. • Users must be aware of data distribution.
Synchronous Replication • Read-any Write-all: Writes are slower and reads are faster, relative to Voting. • Most common approach to synchronous replication. • Voting: Xact must write a majority of copies to modify an object; must read enough copies to be sure of seeing at least one most recent copy. • E.g., 10 copies; 7 written for update; 4 copies read. • Each copy has version number. • Not attractive usually because reads are common. • Choice of technique determines which locks to set.
Cost of Synchronous Replication • Before an update Xact can commit, it must obtain locks on all modified copies. • Sends lock requests to remote sites, and while waiting for the response, holds on to other locks! • If sites or links fail, Xact cannot commit until they are back up. • Even if there is no failure, committing must follow an expensive commit protocol with many msgs. • So the alternative of asynchronous replication is becoming widely used.
Asynchronous Replication • Allows modifying Xact to commit before all copies have been changed (and readers nonetheless look at just one copy). • Users must be aware of which copy they are reading, and that copies may be out-of-sync for short periods of time. • Two approaches: Primary Site and Peer-to-Peer replication. • Difference lies in how many copies are ``updatable’’ or ``master copies’’.
Peer-to-Peer Replication • More than one of the copies of an object can be a master in this approach. • Changes to a master copy must be propagated to other copies somehow. • If two master copies are changed in a conflicting manner, this must be resolved. (e.g., Site 1: Joe’s age changed to 35; Site 2: to 36) • Best used when conflicts do not arise: • E.g., Each master site owns a disjoint fragment.
Primary Site Replication • Exactly one copy of a relation is designated the primary or master copy. Replicas at other sites cannot be directly updated. • The primary copy is published. • Other sites subscribe to (fragments of) this relation; these are secondary copies. • Main issue: How are changes to the primary copy propagated to the secondary copies? • Done in two steps. First, capture changes made by committed Xacts; then apply these changes.
Implementing the Capture Step • Log-Based Capture: The log (kept for recovery) is used to generate a Change Data Table (CDT). • If this is done when the log tail is written to disk, must somehow remove changes due to subsequently aborted Xacts. • Procedural Capture: A procedure that is automatically invoked does the capture; typically, just takes a snapshot. • Log-Based Capture is better (cheaper, faster) but relies on proprietary log details.
Implementing the Apply Step • The Apply process at the secondary site periodically obtains (a snapshot or) changes to the CDT table from the primary site, and updates the copy. • Period can be timer-based or user/application defined. • Replica can be a view over the modified relation! • If so, the replication consists of incrementally updating the materialized view as the relation changes. • Log-Based Capture plus continuous Apply minimizes delay in propagating changes. • Procedural Capture plus application-driven Apply is the most flexible way to process changes.
Distributed Locking • How do we manage locks for objects across many sites? • Centralized: One site does all locking. • Vulnerable to single site failure. • Primary Copy: All locking for an object done at the primary copy site for this object. • Reading requires access to locking site as well as site where the object (copy/fragment) is stored. • Fully Distributed: Locking for a copy is done at the site where the copy/fragment is stored. • Locks at all or many sites while writing an object.
Distributed Deadlock Detection • Each site maintains a local waits-for graph. • A global deadlock might existeven if the local graphs contain no cycles: T1 T2 T1 T2 T1 T2 SITE A SITE B GLOBAL • Three solutions: Centralized (send all local graphs to one site); Hierarchical (organize sites into a hierarchy and send local graphs to parent in the hierarchy); Timeout (abort Xact if it waits too long).
Distributed Transactions • Problem: Atomicity and Durability in DDBMSs • Example: • A+=100 at Paris, B-=100 at London • Each is called a subtransaction. • The query originates at a third site called the coordinator. • Who keeps the log? Each site. • How does each subtransaction know when to commit?
2-Phase Commit (2PC) Protocol • Coordinator sends prepare message to each subtransaction • Each subtransaction responds with yes or no • Yes means it has done everything but write the end record • Coordinator sends commit or abort to all subtransactions • Subtransactions send ack messages to coordinator
2PC Issues • What if • only one subtransaction says no? • All subtransactions say yes, then one cannot commit? • All subtransactions say yes, then one crashes?
Learning Objectives • Definitions: parallel and distributed DBMSs, scalability, 3 parallel architectures, pipeline and partition parallelism, 3 types of data partitioning and their uses • How to parallelize operators and determine their scalability: scan, select, sort, aggregate, hash join • Definition of distributed data independence and acidity, network architectures, availability, fragmentation, replication • How is replication managed? • How are distributed queries processed? • How is distributed data updated? • How are distributed locks and deadlocks managed? • How are distributed transactions managed?