Proactive Re-optimization

Proactive Re-optimization Research by Shivnath Babu, Pedro Bizarro, David DeWitt Presented by Pedro Bizarro

Roadmap • Introduction • Re-Optimization (and its Limitations) • Proactive Re-Optimization • Bounding Boxes • Robust and Switchable Plans • Random Samples • Experimental Results • Conclusions

What is the Problem? Sometimes database optimizers choose query plans that are sub-optimal by orders of magnitude

How the Problem Arises (1) • Optimizer relies on statistics to cost plans However: • some statistics are not up-to-date and • most are not even in the catalog • Missing statistics are estimated using • other (possibly estimated) statistics • assumptions (independency, uniformity, etc) • default values Errors in Estimates

How the Problem Arises (2) • Well-know fact: errors in estimates for statistics in intermediate expressions grow exponentially • Also, cost functions are not smooth • Small errors in estimate may be big errors in cost • E.g.: Cost Memory Small errors become big errors

Re-optimization Overview[KD98, M+04] • Compute plan as normal • Add check operators to plan to: • check when plan becomes sub-optimal • check for significant discrepancies between estimated and observed values • Execute and react approach • Trigger re-optimization when checks fail • Else, proceed

Traditional optimizer chooses plan P Re-optimizer chooses same plan P and adds checks INLJ INLJ CHECK T T INLJ INLJ CHECK R S R S Choosing Plans:Optimization vs. Re-Optimization

Re-optimization’sMain Limitation • Optimizer picks plan unaware of possible future re-optimizations • I.e., optimization assumes no re-optimization • What can go wrong? Can we do better?

Example: Query: σ(R) S Plan P1: σ(R) HHJ S Plan P2: S HHJσ(R) P1 Cost P2 Size of σ(R) Estimated size of σ(R) Re-optimization Limitations • Re-optimization is expensive(could avoid it by using robust plans) • May lose partial work • If start on P1 and re-optimize to P2, will repeat scan on R P1 is risky! P2 is robust.

Re-optimize Re-optimize Re-optimize INLJ HHJ HHJ HHJ T T T σ(R) INLJ HHJ HHJ HHJ σ(R) σ(R) S S S S σ(R) T Re-optimization Limitations (2) • Limited information collected at run-time • Only detects when to re-optimize • E.g., detects that some selectivity > 5% • But, future re-optimization needs more information • E.g., what is the true value of selectivity? • May “thrash” without that knowledge

Proactive Re-Optimization in a Nutshell If DBMS knows re-optimization can happen: It can (proactively) pick plans that are better than picking a plan and then (reactively) re-optimizing if something goes wrong. It should try to avoid it It can (proactively) collect statistics needed for future re-optimization steps for the same query. It should plan for it

Building Blocks ofProactive Re-optimization • Use of bounding boxes (intervals around estimates) • Use of robust plans and switchable plans • Enhanced run-time statistics collection

1. Compute bounding boxes for estimates No, reoptimize Estimate within the bounding box? Yes, use robust or switchable plan 3. Execute query; Collectaccurate statistics estimates Run-time estimates Proactive Re-optimization Architecture Optimization QUERY 2. Use bounding boxes to pick robust or switchable plans CATALOG Execution

Bounding Boxes:Representing Uncertainty • Interval around estimate is: • wide if optimizer uncertain about estimate • narrow if optimizer certain about estimate • How is uncertainty measured? • From the way the statistic is estimated, e.g.: • Histogram -> very certain • Multiplication of selectivities -> uncertain • Default guess -> very uncertain • Etc.

Query: σ(R) S Bounding Boxes:Representing Uncertainty • Interval around estimate is: • wide if optimizer uncertain about estimate • narrow if optimizer certain about estimate |S| Bounding box high est. low |σ(R)| low estimated high

Bounding Boxes:Plan Costing and Pruning • Costing - Computes three costs per each plan tree: (2-dim bounding box using cardinality estimates from sub-plans) • Pruning • For each join subset and interesting order: • Finds 3 plans: BestLow, BestEst, and BestHigh • The plans with lowest CLow, CEst, and CHigh, respectively • All others are pruned Cost? |S| CHigh, cost here CEst, cost here CLow, cost here HHJ high est. σ(R) low S |σ(R)| low estimated high

At the end of plan enumeration there are three seed plans: Four cases: The seeds are the same plan One of the seeds is robust A switchable plan can be created from them No single plan, not robust, not switchable P1 Query: σ(R) S Plan P1: σ(R) HHJ S Plan P2: S HHJσ(R) Cost P2 Size of σ(R) Bounding box for σ(R) estimate low estimated high cost close to optimal in all 3 points of bounding box Selecting Plans BestEst=P1 BestHigh=P2 BestLow=P1

Not really switchable, but late binding decision Switchable Plans • Switchable Plan Goals: • Pick plan only after uncertainty resolved • Do no lose or repeat work • Plans are switchable if: • Have different root operator • Have the same sub-plan for deep input of root • Have the same base table as other input E.g.: INLJ Hash2 Hash3 Index Seek on T Scan T Scan T Hash1 Hash1 Hash1 Scan R Scan S Scan R Scan S Scan R Scan S

switch operator INLJ, Index seek on T ? Hash2, Scan T buffer operator Hash3, Scan T Implementation ofa Switchable Plan • Buffer tuples until a tuple random sample is obtained • Compute estimate and pass it up to switch operator • Switch operator instantiates correct operator • No work lost or repeated Switchable Plan T Hash1 Scan R Scan S

Observing Statistics at Run-Time • Goal 1: Detect when to re-optimize • Detect when outside bounding box • Goal 2: Must be cheap • To minimize impact on execution • Goal 3: Must be quick • To avoid spending time in wrong plan • Goal 4: Must be accurate • To avoid re-optimization thrashing

b c d f g i j Emit eos(30%) punctuation to parent operator The Idea:Random Sample Prefix • Prefix output of operators with random sample of entire output Normal output without random prefix a a b c d e e f g h h i j Output with random prefix • When eos punctuation arrives system can: • Compute estimate based on random sample • Determine if bounding box violation happened • Use revised estimate in future re-optimization

ImplementingRandom Sample Prefixes • Assume random samples of base relations • For each relation R, there is an R_sample • R_sample is a random sample of R • Modified scan operator: • scan R_sample • emit eos • scan R skipping tuples in R_sample • Similarly for indexed scan

NLJ Output R.S tuples and EOS … … Probe side Build side R sample S sample Memory: rest of R rest of S Disk: ImplementingRandom Sample Prefixes - Joins • Modified BNLJ and INLJ operators • Pass eos from outer relation • Ignore eos from inner relation • True random sample of join if outer is FK side • Hash join: eos eos Then, do second pass as in normal hash Read S into memory until EOS of S Read rest of S, partition, probe R, output tuples Read all of R, partition, probe S sample R’ S sample S’ R sample S sample rest of R rest of S R’’ S’’

Traditional Re-Optimizer Proactive Re-Opt Switch HHJ HHJ σ(O) σ(O) σ(O) Switch HHJ HHJ σ(A) C σ(A) C σ(A) C Re-optimizes Re-optimizes Sub-optimal HHJ HHJ Optimal σ(O) HHJ HHJ σ(A) Sub-optimal σ(A) C C σ(O) Wrap-up Example: σ(A)*C*σ(O) Assume error in Estimate σ(A)!

Experimental Evaluation Setup • Optimizers used: • TRAD – Traditional optimizer • VRO – Reactive re-optimizer • Rio – Our proactive re-optimizer • Built within Predator DBMS • DMV Dataset: Adapted from real dataset • Contains correlations between attributes Our implementation of [MR+04]

Using Robust Plans: Join σ(A)*CError in |σ(A)| estimate, 150MB per hash join

Using Switchable Plans: Join σ(A)*CError in |σ(A)| estimate, 50MB per hash join

3-way join: σ(A)*C*σ(O)Error in |σ(A)| estimate, 50MB per hash join

Query ComplexityErrors due to correlated attributesVRO-R is VRO optimizer plus random sampling

Conclusions • “Look around” with bounding boxes • Use robust plans • Use late binding switchable plans • Either avoid re-optimization • Or re-optimize quickly and with accurate estimates • Use tuple random samples • Overhead small: optimization time <1.7%

Acknowledgements… Jennifer Widom for discussion and feedback Guy Lohman and Volker Markl for providing DMV data and workload generator … and References [KD98] Kabra and DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. SIGMOD’98 [M+04] Markl, et al. Robust Query Processing through Progressive Optimization. SIGMOD'04.

Thank you! Questions? Feedback? Rio is a Pro-Active Re-Optimizer developed by Shivnath Babu, Pedro Bizarro, and David DeWitt

Proactive Re-optimization