1 / 35

Robust Query Processing through Progressive Optimization

Robust Query Processing through Progressive Optimization. Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic. SIGMOD 2004. Modified by S. Sudarshan from talk by Raja Agrawal. Motivation.

Download Presentation

Robust Query Processing through Progressive Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust Query Processing through Progressive Optimization Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic SIGMOD 2004 Modified by S. Sudarshan from talk by Raja Agrawal

  2. Motivation • Current optimizers depend heavily upon the cardinality estimations • What if there errors in those estimations? • Errors can occur due to … • Inaccurate statistics • Invalid assumptions (e.g. attribute independence)

  3. Progressive Query Optimization • Idea: lazily trigger reoptimization during execution if cardinality counts indicate current plan is suboptimal • introduces checkpoint (CHECK) operator to compare actual vs estimated cardinality • key idea: precompute cardinality ranges for which plan is optimal

  4. Evaluating a re-optimization scheme • Risk Vs Opportunity • Risk: • Extent to which re-optimization is not worthwhile • reoptimization chooses another bad plan • work redone • cardinality errors may even cancel, and fixing one may give an even worse plan! • Opportunity: • Refers to the aggressiveness • more CHECK operators..

  5. Background Risk • Redbrick • Star schema with fact table and multiple dimension tables • First apply selections on dimension tables • Then decide what plan to use • Kabra & DeWitt 98 (KD98) • Introduced idea of mid-query reoptimization • Allow partial results to be use like materialized views • But ad-hoc cardinality threshold, and only reoptimize fully materialized plans Opportunity

  6. Background Risk • Tukwila data integration system • optimizer may have no idea of statistics • interleave optimization and query execution • partial query plans • Fragment: fully pipelined tree with doubly pipelined hash join • Query Scrambling • reorder query to deal with delayed sources Opportunity

  7. Background Risk • Eddies (Telegraph) • Ingres/DEC Rdb: run multiple access methods competitively then choose • Parametric Query Optimization (PQO) • e.g. Cole and Graefe 94, Hulgeri and Sudarshan 02 • Choose from a set of plans, each optimal for selectivity range • POP: converse: find optimal cardinality range for a give plan Opportunity

  8. Example of Progressive Optimization in Action

  9. Progressive Query Optimization(POP)

  10. Architecture of POP

  11. Architecture of POP • CHECK operator to find if a plan is suboptimal • At optimization time, find out cardinality range (at CHECK location) for which plan is optimal • At run time, ensure cardinality within [l,u] • If violated, stop plan execution and reoptimize • Location of CHECKs • Re-optimize • taking observed cardinality into account, and • exploiting intermediate results where beneficial • Heuristic: limit number of reoptimizations (default: 3)

  12. Validity Ranges • Consider a plan edge e that flows rows into operator o, • let P be the subplan rooted at o. • The validity rangefor e is an upper and lower bound on the number of rows flowing through e, such that if the range is violated at runtime, we can guarantee P is suboptimal • Ad-hoc thresholds (proposed earlier) are a bad idea • E.g. even a 100x error on very small relation may not make a difference in optimal plan

  13. Finding Optimality Ranges • Plan Popt with root operator oopt is being compared with another plan Palt different only in the root operator oalt.

  14. Finding Optimality Ranges • Need to solve • cost(Palt , c) – cost(Popt , c) = 0 • where c is the cardinality on edge e • Cost functions can be complex/non-linear/non-continuous

  15. Newton-Raphson Iteration

  16. What does this achieve? • Detects suboptimality of the root operator where Popt and Palt share the same input edges. • Validity range might miss a cross-over point with a plan that uses a different join order (and hence has different input edges). • Two plans are structurally equivalent if they share the same set of edges • where an edge is defined by the set of rows flowing through it during query execution. • Allows different algorithms, and flipping inner/outer

  17. Optimality wrt structurally equivalent plans • Theorem: …. Suppose edges edges ei1 , ei2 , … , eik are seen to be “erroneous” wrt cardinality. Then the following statements are equivalent: • P is suboptimal with respect to another plan P' that has the same set of edges {e1 , e2 , … , em} • At least one of Pi1 , Pi2 , … , Pik is suboptimal given the cardinality errors in those edges in {e1 , e2 , … , em } that lie under them. • At least one of oi1 , oi2 , … , oik is a suboptimal operator given the cardinality errors in {e1 , e2 , … , em} that are in its input edges.

  18. Conservative detection of suboptimality • Suppose we “detect” suboptimality of (R Join S) Join T wrt estimated costs of (R Join T) Join S • During run time, we can never observe the cardinality of R Join T • We would be making an arbitrary guess as to the correlation of the predicates on the R and T tables • Best not to infer suboptimality wrt such estimates • However, reoptimization may result in a different join order

  19. Exploiting Intermediate Results • All the intermediate results are stored as temporary MVs • with cardinalities available to the optimizer • can be reused if it leads to a better plan • but not necessarily used, e.g. if join result is very large, and a different join order is preferred • must be reused if it has performed side-effects • Reoptimization done as part of same transaction

  20. Optional use of MV

  21. Variants of CHECK • Variants applicable in different cases, trade off risk for opportunity • Variants • Lazy checking • Lazy checking with eager materialization • Eager checking without compensation • Eager checking with buffering • Eager checking with deferred compensation

  22. Lazy Checking • Adding CHECKs above a materialization point (SORT, TEMP etc) • No results have been output yet • And materialized results can be re-used • very low overhead

  23. Lazy checking with eager materialization • Can insert materialization point if it does not exists already • Risk: overhead of materialization • Typically done only for outer input of indexed nested-loop join • low cost if outer is small (as estimated by optimizer) • and INL is in trouble anyway if outer is large

  24. Eager Checking • Lazy checking may be too late • e.g. if very bad join order chosen, with huge intermediate results • Idea: check even before entire result is materialized, and stop early • Problem: what if some results have already been output? • Compensation

  25. Eager Checking • EC without Compensation: • CHECK is pushed down the materialization point, into pipeline

  26. Eager Checking • EC with buffering • CHECK and buffer • output from buffer once sure about bound • e.g. [0,b), or [b,infinity] • else reoptimize • “delayed pipelining”

  27. EC with Deferred Compensation • Only SPJ queries • Identifier of all rows returned to the user are stored in a table S, which is used later in the new plan for anti-join with the new-result stream

  28. CHECK Placement

  29. CHECK Placement • LCEM and ECB – outer side of nested-loop join • LC – above materialization points • ECWC and ECDC – anywhere • Do not place CHECKs if • no alternative plan above CHECK • simple queries with low estimated cost

  30. Performance Analysis: Robustness • TPC-H Q10: Replace constant in selection on lineitem by parameter marker, so optimizer doesn’t know actual selectivity • 5 different optimal plans

  31. Risk Analysis • Analyze LC, LCEM, ECB • Can be reoptimized more than once • Conclusion: low overhead/risk

  32. Opportunity Analysis • Goal: how often does opportunity to reoptimize arise? • Introduce LC/LCEM/ECB checkpoints • But turn off reoptimization, and run same plan • Opportunity region for ECB: dotted line

  33. POP in (in)action • Real world workload (DMV data and queries) • Complex predicates leading to cardinality estimation errors • substring comparison, like, IN, ..

  34. POP in (in)action (contd.) • Re-optimization may result in the choice of worse plan due to: • Two estimation errors canceling out each other • Re-using intermediate results

  35. Conclusions • POP gives us a robust mechanism for re-optimization through inserting of CHECK (in its various flavors) • Higher opportunity at low risk

More Related