Query Optimization Strategies

Query Optimization Strategies Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems September 22, 2008

Administrivia • Thursday: • Some initial suggestions for the project proposal • Scheduling of the deadline for your midterm report The next assignment: • Read Gray et al. (granularity of locks) and Kung and Robinson (optimistic concurrency control) papers • Review Kung and Robinson by next Tuesday • Consider how optimistic CC is or isn’t useful in Web-scale data management

Query Optimization • Challenge: pick the query execution plan that has minimum cost • Sources of cost: • Interactions with other work • Size of intermediate results • Choices of algorithms, access methods • Mismatch between I/O, CPU rates • Data properties – skew, order, placement

The General Model of Optimization Given an AST of a query: • Build a logical query plan (Tree of query algebraic operations) • Transform into “better” logical plan • Convert into a physical query plan (Includes strategies for executing operations)

Strategies Basically, search, over the space of possible plans • At least of exponential complexity in the number of operators • Hence, exhaustive search is generally not feasible What can you do? • Heuristics only – INGRES, Oracle until the mid-90s • Randomized, simulated annealing, … – many efforts in the mid-90s • Heuristics plus cost-based join enumeration – System-R • Stratified search: heuristics plus cost-based enumeration of joins and a few other operators – Starburst optimizers • Unified search: full cost-based search – EXODUS, Volcano, Cascades optimizers

What Are the Cost Estimating Factors? • Some notion of CPU, disk speeds, page sizes, buffer sizes, … • Cost model for every operator • Some information about tables and data • Sizes • Cardinalities • Number of unique values (from index) • Histograms, sketches, …

The System-R Approach: Heuristics with Cost-Based Join Enumeration • Make the following assumptions: • All predicates are evaluated as early as possible • All data is projected away as early as possible • Separately consider operations that produce intermediate state or are blocking: • Joins • Aggregation • Sorting • Correlation with a subquery (join, exists, …) • By choosing a join ordering, we’re automatically choosing where selections and projections are pushed – why is this so?

System-R Architecture • Breaks a query into its blocks, separately optimizes them • Nested loops join between them, if necessary • Within a block: focuses on joins (only a few kinds) in dynamic programming enumeration • Principle of optimality: best k-way join includes best (k-1)-way join • Use simple table statistics when available, based on indices; “magic numbers” where unavailable • Heuristics • Push “sargable” selects, projects as low as possible • Cartesian products only after joins • Left-linear trees only: n2n-1 cost-estimation operations • Grouping last • Extra “interesting orders” dimension Grouping, ordering, join attributes

Example • Schema: R(a,b), S(b,c), T(a,c), U(c,d) • SELECT d, AVG(c)FROM R,S,T,UWHERE R.a=S.b AND S.c=T.c AND R.a=T.a AND T.c=U.cGROUP BY dORDER BY d • In relational algebra: γd/AVG(c)(Πd(σa < 2 (R ⋈ S ⋈ T ⋈ U)))

Why We Need More than System-R • Cross-query-block optimizations • e.g., push a selection predicate from one block to another • Better statistics • More general kinds of optimizations • Optimization of aggregation operations with joins • Different cost and data models, e.g., OO, XML • Additional joins, e.g., “containment joins” • Can we build an extensible architecture for this? • Logical, physical, and logical-to-physical transformations • Enforcers • Alternative search strategies • Left-deep plans aren’t always optimal • Perhaps we can prune more efficiently

Optimizer Generators • Idea behind EXODUS and Starburst:Build a programming language for writing custom optimizers! • Rule-based or transformational language • Describes, using declarative template and conditions, an equivalent expression • EXODUS: compile an optimizer based on a set of rules • Starburst: run a set of rules in the interpreter; a client could customize the rules

Starburst Query Optimizer • Corona query processor vs. Core engine – follows RDS and RSS separation • Part of a very ambitious project • Hydrogen language highly orthogonal (unlike SQL), and supported many fancy OO concepts (e.g., inheritance, methods), recursion special constraints, etc. – much more powerful than SQL at the time • Some portions of Hydrogen and Corona made their way into DB2, later SQL standards • Two stages (stratified search): • Query rewrite/query graph model – a SQL-block-level, relational calculus-like representation of queries • Plan optimization – a System-R-style dynamic programming phase once query rewrite has completed

Starburst QGM Tries to encode relational calculus-like concepts: • Predicates between variables within each SQL block body • Variables can be • Distinguished (“set-builder” / “for-each” / “F”) • Existential (“∃”) • Universal (“∀”) • Returned values from each block (head) • Predicates across blocks

Starburst QGM Example SELECT partno, price, order_qty FROM quotations Q1 WHERE Q1.partno IN (SELECT partnoFROM inventory Q3WHERE Q3.onhand_qty < Q1.order_qtyAND Q3.type = ‘cpu’

Starburst Query Rewrite Focus: inter-block optimizations • Pushing predicates across views, pushing projections down • Magic sets rewritings • Simplification, transitivity Implemented through production rules • Condition – action rules selected by: • Sequence • Priority • Probability distribution • Search may stop after a “budget” • End product: logical plan(s), chosen via above constraints • Normally: set is singleton • Can also CHOOSE among set by invoking the cost-based plan optimizer

Query Rewrite Example • Convert subquery to join: IF OP1.type = Select Æ Q2.type = ‘9’ Æ (at each evaluation of the existential predicate at most one tuple of T2 satisfies the predicate) THEN Q2.type = ‘F’ • Merge operations: IF OP1.type = Select Æ OP2.type = Select Æ Q2.type = ‘F’ Æ (NOT (T1.distinct = false Æ OP2.eliminate_duplicate = true)) THEN merge OP2 into OP1; IF OP2.eliminate_duplicateTHEN OP1.eliminate_duplicate = true

Starburst Plan Optimization • Separately optimizes each QGM operation (box) • Grammar of STrategy AlteRnatives (STARs) • Take high-level operations and turn them into LOw-LEvel Plan OPerations (LOLEPOPs) • JOIN, UNION, SCAN, SORT, SHIP, … • Tables and plans have properties • Relational (tables joined, columns accessed, predicates applied) • Operational (ordering, site) • Estimated (cost, cardinality) • GLUE operators: SORT, SHIP • Join enumerator tries alternative join sequences a la System-R • Can produce bushy trees • Can have rank/priority with each STAR

Starburst Pros and Cons • Pro: • Stratified search generally works well in practice – DB2 UDB has perhaps the best query optimizer out there • Interesting model of separating calculus-level and algebra-level optimizations • Generally provides fast performance • Con: • Interpreted rules were too slow – and no database user ever customized the engine! • Difficult to assign priorities to transformations • Some QGM transformations that were tried were difficult to assess without running many cost-based optimizations • Rules got out of control

Volcano Optimizer Generator • Part of a “database toolkit” approach to building systems • A set of libraries and languages for building databases with custom data models and functionalities (Volcano) (gcc) (MyDB plan) (MyQL)

Full Transformational Optimizer Model: 3 Generations • Try to unify the notion of logical-logical transformations and logical-physical transformations • No stratification as in Starburst – everything is transformations • Challenge: efficient search – need a lot of pruning • First version, EXODUS: used many heuristics, something called a MESH • Volcano: branch-and-bound pruning, recursion + memoization • Cascades: Basically, a cleaner, more object-oriented version of the Volcano engine • According to rumor, the basis of SQL Server’s optimizer

Example Rules • Physical operators: %operator 2 join %method 2 hash_join loops_join cartesian_product • Logical-logical transformations: join (1,2) ->’ join(2,1) • Logical-physical transformations: join (1,2) by hash_join (1,2) • Can get quite hairy: join 7 (join 8 (1,2), 3) <-> join 8(1, join 7 (2,3)){{#ifdef FORWARDif (NOT cover_predicate (OPERATOR_7 oper_argument, INPUT_2 oper_property, INPUT_3 oper_property)) REJECT …

Volcano: Efficient Search • Needs to enumerate all possible transformations without repeating • A top-down, memoized engine • Depth-first search – allows branch-and-bound pruning • Go from logical  physical plan as fast as possible, then try alternative plans • FindBestPlan takes logical expression, physical properties, cost bound: • If already computed, return • Else compute set of possible “moves”: • Logical-logical rule • Compliant logical-physical rule • Enforcer • Insert logical expression into lookup table • Insert physical op, plan into separate lookup table • Return best plan and cost • Generic notions of properties and enforcers (e.g., location, SHIP), cost (an ADT)

Optimization Evaluation • So, which is best? • Heuristics plus join-based enumeration (System-R) • Stratified, calculus-then-algebraic (Starburst) • Con: QGM transformations are almost always heuristics-based • Pro: very succinct transformations at QGM level • Unified algebraic (Volcano/Cascades) • Con: many more rules need to be applied to get effect of QGM rewrites • Pro: unified, fully cost-based model

Query Optimization Strategies

Query Optimization Strategies

Presentation Transcript

Query Optimization

Query Optimization Strategies

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization