290 likes | 387 Views
Selectivity-Based Partitioning. Alkis Polyzotis UC Santa Cruz. Query Optimization. Integral component of declarative query processing Key problem: join ordering Most important (and most complex!) module of a DBMS. R 1 R 2 R 3 R 4. Parser. R 1 R 2 R 3 R 4. Optimizer.
E N D
Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz
Query Optimization • Integral component of declarative query processing • Key problem: join ordering • Most important (and most complex!) module of a DBMS R1 R2 R3 R4 Parser R1 R2 R3 R4 Optimizer ( (R2 R3)R1) R4 Execution Engine
“Monolithic” Query Optimization • Output: a single join order based on join selectivities between tables Plan:(P E) D
Partition-Based Query Optimization • Output: multiple join orders based on selectivities between fragments of tables Plan: ( (P D2) E ) ( (E D1) P )
Selectivity-Based Partitioning • Divide-and-Union paradigm • Optimization problem and analysis • Partitioning algorithm • Experimental results
Roadmap • Preliminaries • Problem Definition • Partitioning Algorithm • Optimal Splits • Iterative Partitioning • Experimental Results • Conclusions
Data and Query Model • Chain-join queries • Example: R1 R2 R3 R4 • Relations may have optional selections • Relation Frequency matrix • Left-deep evaluation plans • Example: R3 R2 R4 R1 R1 R4 R3 R2
Problem Definition • Given: query Q, maximum partition count N • Goal: find partitioning of Q in nN partitions that minimizes query cost • On-the-fly partitioning vs. Off-line partitioning • Difficult optimization problem! • Determine the pivot relation • Determine the number of partitions • Compute a partitioning of the pivot • Determine the orderings of partitioned plans R1R21 R4 R3 R1 R2 R3 R4 R3R22 R1 R4
Query Cost Function • One possibility: optimizer’s cost model • Accurate cost estimation • Solution depends on low-level system details • Difficult to gain intuitions • Our approach: query cost = number of intermediate results • Simple function that admits analysis • Sound connections to realistic cost models (Cluet and Moerkotte, ICDT’95) Cost(R3 R2 R4 R1 ) = |R3 R2| + |R3 R2 R4|
Roadmap • Preliminaries • Problem Definition • Partitioning Algorithm • Optimal Splits • Iterative Partitioning • Experimental Results • Conclusions
Partitioning Algorithm - Overview • State space: partitioned join orders • Partitioning algorithm: • Explore a set of states • Compute optimal partitioning for each state • Return global optimum • Our approach: order joins then partition • Another possibility: partition then order joins
Distributing Tuples • Goal: Distribute tuples to minimize cost • Optimal distribution depends on: • Frequency matrices of other relations • Position (m,l)
Optimal Split Theorem • Distribute each value (m,l) independently • Place (m,l) in partition that minimizes g(L,T,m,l)
Partitioning Algorithm - Overview • State space: partitioned join orders • Partitioning algorithm: • Explore a set of states • Compute optimal partitioning for each state • Return global optimum
Search Algorithm • Exhaustive search is impractical [ Pivot, Leading orders, Trailing orders ] • Search heuristics: • Tighter search space: [ Pivot, Optimal Leading orders ] • Iterative Partitioning • Guided search by using lower bounds on cost of partitions
Encoding of State Space • State: [ Pivot , Optimal leading orders ] • Transition: insert relation in a leading order
R21 R21 R22 R22 Iterative Partitioning • Key idea: (Partition, Optimize)+ • Compute optimal split for leading/trailing orders • Optimize trailing orders for the current split • Theorem: query cost can only decrease • Idea extended to more detailed cost models Leading Trailing R3 R5 R4 R3 R4 R5 R1 R2 R3 R4 R1 R5 R5 R1
Search Algorithm • Initial states: single-relation leading orders • Search process: • Compute partitions with IP • Open more states with transition function • Transitions are guided by lower bound on cost function • Same lower bound can also prune states • Stopping criteria: • Search space is exhausted • Time budget is exhausted
System Integration Monolithic Partition-based Parser Parser Partitioner Optimizer Optimizer Execution Engine Execution Engine
Roadmap • Preliminaries • Problem Definition • Partitioning Algorithm • Optimal Splits • Iterative Partitioning • Experimental Results • Conclusions
Effect of Skew Synthetic Data
Execution Time Synthetic Data (Skew=1.5)
Varying Time Budget Synthetic Data (Skew=1.5)
Results on Real-Life Data SwissProt
Conclusions • Monolithic optimization Missed opportunities • Selectivity-Based Partitioning • Divide & Union approach • Multiple join orders per query • Join selectivity between relation fragments • Partitioning Algorithm • Iterative Partitioning • Experimental Results • Significant reduction of intermediate results
Future Work • Extension to multiple pivots • Partition-then-order optimization • Efficient execution of partitioned plans • Off-line workload-aware partitioning
Partitioning Model • General case: Multi-relation partitioning • Our approach: Single-relation partitioning R1R21 R4 R3 R1 R2 R3 R4 R31R22 R1 R4 R1R22R32 R4