240 likes | 430 Views
Query Optimization Strategies. Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems September 22, 2008. Administrivia. Thursday: Some initial suggestions for the project proposal Scheduling of the deadline for your midterm report The next assignment:
E N D
Query Optimization Strategies Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems September 22, 2008
Administrivia • Thursday: • Some initial suggestions for the project proposal • Scheduling of the deadline for your midterm report The next assignment: • Read Gray et al. (granularity of locks) and Kung and Robinson (optimistic concurrency control) papers • Review Kung and Robinson by next Tuesday • Consider how optimistic CC is or isn’t useful in Web-scale data management
Query Optimization • Challenge: pick the query execution plan that has minimum cost • Sources of cost: • Interactions with other work • Size of intermediate results • Choices of algorithms, access methods • Mismatch between I/O, CPU rates • Data properties – skew, order, placement
The General Model of Optimization Given an AST of a query: • Build a logical query plan (Tree of query algebraic operations) • Transform into “better” logical plan • Convert into a physical query plan (Includes strategies for executing operations)
Strategies Basically, search, over the space of possible plans • At least of exponential complexity in the number of operators • Hence, exhaustive search is generally not feasible What can you do? • Heuristics only – INGRES, Oracle until the mid-90s • Randomized, simulated annealing, … – many efforts in the mid-90s • Heuristics plus cost-based join enumeration – System-R • Stratified search: heuristics plus cost-based enumeration of joins and a few other operators – Starburst optimizers • Unified search: full cost-based search – EXODUS, Volcano, Cascades optimizers
What Are the Cost Estimating Factors? • Some notion of CPU, disk speeds, page sizes, buffer sizes, … • Cost model for every operator • Some information about tables and data • Sizes • Cardinalities • Number of unique values (from index) • Histograms, sketches, …
The System-R Approach: Heuristics with Cost-Based Join Enumeration • Make the following assumptions: • All predicates are evaluated as early as possible • All data is projected away as early as possible • Separately consider operations that produce intermediate state or are blocking: • Joins • Aggregation • Sorting • Correlation with a subquery (join, exists, …) • By choosing a join ordering, we’re automatically choosing where selections and projections are pushed – why is this so?
System-R Architecture • Breaks a query into its blocks, separately optimizes them • Nested loops join between them, if necessary • Within a block: focuses on joins (only a few kinds) in dynamic programming enumeration • Principle of optimality: best k-way join includes best (k-1)-way join • Use simple table statistics when available, based on indices; “magic numbers” where unavailable • Heuristics • Push “sargable” selects, projects as low as possible • Cartesian products only after joins • Left-linear trees only: n2n-1 cost-estimation operations • Grouping last • Extra “interesting orders” dimension Grouping, ordering, join attributes
Example • Schema: R(a,b), S(b,c), T(a,c), U(c,d) • SELECT d, AVG(c)FROM R,S,T,UWHERE R.a=S.b AND S.c=T.c AND R.a=T.a AND T.c=U.cGROUP BY dORDER BY d • In relational algebra: γd/AVG(c)(Πd(σa < 2 (R ⋈ S ⋈ T ⋈ U)))
Why We Need More than System-R • Cross-query-block optimizations • e.g., push a selection predicate from one block to another • Better statistics • More general kinds of optimizations • Optimization of aggregation operations with joins • Different cost and data models, e.g., OO, XML • Additional joins, e.g., “containment joins” • Can we build an extensible architecture for this? • Logical, physical, and logical-to-physical transformations • Enforcers • Alternative search strategies • Left-deep plans aren’t always optimal • Perhaps we can prune more efficiently
Optimizer Generators • Idea behind EXODUS and Starburst:Build a programming language for writing custom optimizers! • Rule-based or transformational language • Describes, using declarative template and conditions, an equivalent expression • EXODUS: compile an optimizer based on a set of rules • Starburst: run a set of rules in the interpreter; a client could customize the rules
Starburst Query Optimizer • Corona query processor vs. Core engine – follows RDS and RSS separation • Part of a very ambitious project • Hydrogen language highly orthogonal (unlike SQL), and supported many fancy OO concepts (e.g., inheritance, methods), recursion special constraints, etc. – much more powerful than SQL at the time • Some portions of Hydrogen and Corona made their way into DB2, later SQL standards • Two stages (stratified search): • Query rewrite/query graph model – a SQL-block-level, relational calculus-like representation of queries • Plan optimization – a System-R-style dynamic programming phase once query rewrite has completed
Starburst QGM Tries to encode relational calculus-like concepts: • Predicates between variables within each SQL block body • Variables can be • Distinguished (“set-builder” / “for-each” / “F”) • Existential (“∃”) • Universal (“∀”) • Returned values from each block (head) • Predicates across blocks
Starburst QGM Example SELECT partno, price, order_qty FROM quotations Q1 WHERE Q1.partno IN (SELECT partnoFROM inventory Q3WHERE Q3.onhand_qty < Q1.order_qtyAND Q3.type = ‘cpu’
Starburst Query Rewrite Focus: inter-block optimizations • Pushing predicates across views, pushing projections down • Magic sets rewritings • Simplification, transitivity Implemented through production rules • Condition – action rules selected by: • Sequence • Priority • Probability distribution • Search may stop after a “budget” • End product: logical plan(s), chosen via above constraints • Normally: set is singleton • Can also CHOOSE among set by invoking the cost-based plan optimizer
Query Rewrite Example • Convert subquery to join: IF OP1.type = Select Æ Q2.type = ‘9’ Æ (at each evaluation of the existential predicate at most one tuple of T2 satisfies the predicate) THEN Q2.type = ‘F’ • Merge operations: IF OP1.type = Select Æ OP2.type = Select Æ Q2.type = ‘F’ Æ (NOT (T1.distinct = false Æ OP2.eliminate_duplicate = true)) THEN merge OP2 into OP1; IF OP2.eliminate_duplicateTHEN OP1.eliminate_duplicate = true
Query Rewrite Example • Convert subquery to join: IF OP1.type = Select Æ Q2.type = ‘9’ Æ (at each evaluation of the existential predicate at most one tuple of T2 satisfies the predicate) THEN Q2.type = ‘F’ • Merge operations: IF OP1.type = Select Æ OP2.type = Select Æ Q2.type = ‘F’ Æ (NOT (T1.distinct = false Æ OP2.eliminate_duplicate = true)) THEN merge OP2 into OP1; IF OP2.eliminate_duplicateTHEN OP1.eliminate_duplicate = true
Starburst Plan Optimization • Separately optimizes each QGM operation (box) • Grammar of STrategy AlteRnatives (STARs) • Take high-level operations and turn them into LOw-LEvel Plan OPerations (LOLEPOPs) • JOIN, UNION, SCAN, SORT, SHIP, … • Tables and plans have properties • Relational (tables joined, columns accessed, predicates applied) • Operational (ordering, site) • Estimated (cost, cardinality) • GLUE operators: SORT, SHIP • Join enumerator tries alternative join sequences a la System-R • Can produce bushy trees • Can have rank/priority with each STAR
Starburst Pros and Cons • Pro: • Stratified search generally works well in practice – DB2 UDB has perhaps the best query optimizer out there • Interesting model of separating calculus-level and algebra-level optimizations • Generally provides fast performance • Con: • Interpreted rules were too slow – and no database user ever customized the engine! • Difficult to assign priorities to transformations • Some QGM transformations that were tried were difficult to assess without running many cost-based optimizations • Rules got out of control
Volcano Optimizer Generator • Part of a “database toolkit” approach to building systems • A set of libraries and languages for building databases with custom data models and functionalities (Volcano) (gcc) (MyDB plan) (MyQL)
Full Transformational Optimizer Model: 3 Generations • Try to unify the notion of logical-logical transformations and logical-physical transformations • No stratification as in Starburst – everything is transformations • Challenge: efficient search – need a lot of pruning • First version, EXODUS: used many heuristics, something called a MESH • Volcano: branch-and-bound pruning, recursion + memoization • Cascades: Basically, a cleaner, more object-oriented version of the Volcano engine • According to rumor, the basis of SQL Server’s optimizer
Example Rules • Physical operators: %operator 2 join %method 2 hash_join loops_join cartesian_product • Logical-logical transformations: join (1,2) ->’ join(2,1) • Logical-physical transformations: join (1,2) by hash_join (1,2) • Can get quite hairy: join 7 (join 8 (1,2), 3) <-> join 8(1, join 7 (2,3)){{#ifdef FORWARDif (NOT cover_predicate (OPERATOR_7 oper_argument, INPUT_2 oper_property, INPUT_3 oper_property)) REJECT …
Volcano: Efficient Search • Needs to enumerate all possible transformations without repeating • A top-down, memoized engine • Depth-first search – allows branch-and-bound pruning • Go from logical physical plan as fast as possible, then try alternative plans • FindBestPlan takes logical expression, physical properties, cost bound: • If already computed, return • Else compute set of possible “moves”: • Logical-logical rule • Compliant logical-physical rule • Enforcer • Insert logical expression into lookup table • Insert physical op, plan into separate lookup table • Return best plan and cost • Generic notions of properties and enforcers (e.g., location, SHIP), cost (an ADT)
Optimization Evaluation • So, which is best? • Heuristics plus join-based enumeration (System-R) • Stratified, calculus-then-algebraic (Starburst) • Con: QGM transformations are almost always heuristics-based • Pro: very succinct transformations at QGM level • Unified algebraic (Volcano/Cascades) • Con: many more rules need to be applied to get effect of QGM rewrites • Pro: unified, fully cost-based model