Proactive Re-optimization

Proactive Re-optimization Joint work withShivnath Babu and David DeWitt Pedro Bizarro

What is the Problem? Sometimes database query optimizers choose execution plans that aresub-optimal by orders of magnitude

How the Problem Arises (1) • Statistics may not be up-to-date • Statistics may be missing • Missing statistics are estimated based on • other (possibly estimated) statistics • assumptions (independency, uniformity, etc) • default values Errors in Estimates

How the Problem Arises (2) • Errors on estimated sizes of intermediate tables grow exponentially [IC91] • Cost functions are not smooth Cost Memory Small errors may become big errors [IC91] Ioannidis and Christodoulakis. On the Propagation of Errors in the Size of Join Results. SIGMOD’91

Roadmap • Introduction • Re-Optimization (and its Limitations) • Proactive Re-Optimization • Bounding Boxes • Robust and Switchable Plans • Random Samples • Experimental Results • Conclusions

Re-optimization: Current ApproachE.g., [KD98, I+99, M+04] • Use conventional optimizers • Add check operators to plan to: • check for significant discrepancies between estimated and observed values • check when plan becomes sub-optimal • Execute and react approach • Trigger re-optimization if check fails [KD98] Kabra and DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. SIGMOD’98 [I+99] Ives, et al. An Adaptive Query Execution System for Data Integration. SIGMOD’99. [MR+04]Markl, et al. Robust Query Processing through Progressive Optimization. SIGMOD'04

Traditional optimizer chooses plan P: Re-optimizer chooses same plan P and adds checks INLJ INLJ CHECK T T INLJ INLJ CHECK R S R S Re-optimization: Current Approach(contd.)

Re-optimization’sMain Limitation • Optimizer picks plan unaware of possible future re-optimizations • I.e., optimization assumes no re-optimization • What can go wrong? Can we do better?

P1 Example query: σ(R) S P1: P2: Cost HHJ HHJ P2 σ(R) S σ(R) S Size of σ(R) Estimated size of σ(R) Re-optimization Limitations • Re-optimizing is expensive(could avoid it by using robust plans) • May lose partial work • If start on P1 and re-optimize to P2, will repeat scan on R P1 is risky! P2 is robust.

Re-optimize Re-optimize Re-optimize INLJ HHJ HHJ HHJ T T T σ(R) INLJ HHJ HHJ HHJ σ(R) σ(R) S S S S σ(R) T Re-optimization Limitations (2) • Limited information collected at run-time • Check operators only detect when to re-optimize • E.g: Too long to find a good plan

Proactive Re-Optimization in a Nutshell If DBMS knows re-optimization may happen: Try to avoid it! Pick robust (and switchable) plans Collect statistics for future re-optimization Plan for it!

Use of bounding boxes Use of robust plans and switchable plans Enhanced run-time statistics collection Building Blocks ofProactive Re-optimization Intervals around estimates to represent uncertainty Close to optimal in bounding box Set of plans, each close to optimal in part of bounding box To detect sub-optimality faster and to avoid re-optimization thrashing

1. Compute bounding boxes for estimates No, reoptimize Estimate within the bounding box? Yes, use robust or switchable plan 3. Execute query; Collectaccurate statistics estimates Run-time estimates Proactive Re-optimization Architecture Optimization QUERY 2. Use bounding boxes to pick robust or switchable plans CATALOG Execution

Bounding Boxes:Representing Uncertainty • Interval around estimate is: • wide if optimizer uncertain about estimate • narrow if optimizer certain about estimate • Uncertainty is measured from the way the statistic is estimated [KD98], e.g.: • Histogram -> very certain • Multiplication of selectivities -> uncertain • Default guess -> very uncertain • Etc. [KD98] Kabra and DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. SIGMOD’98

Query: σ(R) S Bounding Boxes:Representing Uncertainty • Interval around estimate is: • wide if optimizer uncertain about estimate • narrow if optimizer certain about estimate Estimated|S| Bounding box high est. low Estimated|σ(R)| low estimated high

Bounding Boxes:Plan Costing and Pruning • Costing - Computes three costs per each plan tree: (2-dim bounding box using cardinality estimates from sub-plans) • Pruning • For each join subset and interesting order find 3 plans: • BestLow  the plan with lowest CLow • BestEst  the plan with lowest CEst • BestHigh  the plan with lowest CHigh Cost? |S| CHigh, cost here CEst, cost here CLow, cost here HHJ high est. σ(R) low S |σ(R)| low estimated high

At the end of plan enumeration there are three seed plans: Four cases: The seeds are the same plan One of the seeds is robust A switchable plan can be created from them No single plan, not robust, not switchable Query: σ(R) S P1: P2: P1 Cost HHJ HHJ σ(R) S σ(R) S P2 Size of σ(R) Bounding box for σ(R) estimate low estimated high cost within 20% of best in all 3 points of bounding box Selecting Plans BestEst=P1 BestHigh=P2 BestLow=P1

E.g.: INLJ Hash2 Hash3 Index Seek on T Scan T Scan T Hash1 Hash1 Hash1 Scan R Scan S Scan R Scan S Scan R Scan S Switchable Plans • Goal: Avoid re-optimization but still run the best plan in bounding box • How: Define switchable plans to allow late binding decision • Plans are switchable if: • Have a different root operator • Have the same sub-plan as one of the inputs to the root • Have the same base table as other input

late binding decision INLJ Hash2 Hash3 Index Seek on T Scan T Scan T Hash1 Hash1 Hash1 Scan R Scan S Scan R Scan S Scan R Scan S Switchable Plans • Execute (part of) the common sub-plan • Collect run-time estimates • Instantiate the best seed plan for those estimates INLJ Hash2 Hash3 Index Seek on T Scan T Scan T ? ? ?

switch operator INLJ, Index seek on T ? Hash2, Scan T buffer operator Hash3, Scan T Implementation ofa Switchable Plan • Buffer tuples until a tuple random sample is obtained • Compute estimate and pass it up to switch operator • Switch operator instantiates correct operator • Minimal overhead Switchable Plan T Hash1 Scan R Scan S

Observing Statistics at Run-Time Uses • Detect when to re-optimize • Pick candidate switchable plan Goals • Must be efficient • Must be quick • Must be accurate

b c d f g i j Emit eos(30%) punctuation to parent operator The Idea:Random Sample Prefix • Prefix output of operators with random sample of their entire output Normal output without random sample prefix a a b c d e e f g h h i j Output with random sample prefix • Propagate sample prefixes bottom up • Implemented for file scan, indexed scan, nested-loops joins, hash join

NLJ … … ImplementingRandom Sample Prefixes • Samples of tables computed ahead of time: • For each table R, there is another table R_sample • Modified scan operator: • scan R_sample • emit eos • scan R skipping tuples in R_sample • Modified nested-loops join operator: • Pass eos from outer relation • True random sample of join if outer is FK side • See paper for hash join eos eos

Experimental Evaluation • Built within Predator DBMS • Implemented three optimizers: • Rio, our Proactive re-optimizer • Reactive re-optimizer • Traditional dynamic programming optimizer • Synthetic version of DMV dataset from IBM • Correlated attributes • More details in the paper Our implementation of [MR+04] [MR+04]Markl, et al. Robust Query Processing through Progressive Optimization. SIGMOD'04

Using Robust PlansQuery σ(A) C

Using Switchable PlansQuery σ(A) C

3-way joinQueryσ(A) C σ(O)

Query ComplexityErrors due to correlated attributes

Conclusions • Ever increasing data, queries, and system • Statistics will be uncertain • Optimizer mistakes will happen • Promising approach: Proactive re-optimization • Bounding boxes • Robust and switchable plans • Quick, efficient, accurate run-time stats collection • Future work: improve individual components

Acknowledgements Jennifer Widom for discussion and feedback Guy Lohman and Volker Markl for providing DMV data and workload generator

Thank you! Questions? Feedback? Check out our demo!

Traditional Re-Optimizer Proactive Re-Opt Switch HHJ HHJ σ(O) σ(O) σ(O) Switch HHJ HHJ σ(A) C σ(A) C σ(A) C Re-optimizes Re-optimizes Sub-optimal HHJ HHJ Optimal σ(O) HHJ HHJ σ(A) Sub-optimal σ(A) C C σ(O) 3-way Join: σ(A) C σ(O) Assume error in Estimate σ(A)!

Proactive Re-optimization