1 / 26

Lecture 9: Query Optimization

Lecture 9: Query Optimization . Sept. 26 2006 ChengXiang Zhai. Most slides are adapted from Kevin Chang’s lecture slides. DBMS Architecture. Today’s lecture. Past lectures. User/Web Forms/Applications/DBA. query. transaction. Query Parser. Transaction Manager. Query Rewriter.

zorina
Download Presentation

Lecture 9: Query Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 9: Query Optimization Sept. 26 2006 ChengXiang Zhai Most slides are adapted from Kevin Chang’s lecture slides

  2. DBMS Architecture Today’s lecture Past lectures User/Web Forms/Applications/DBA query transaction Query Parser Transaction Manager Query Rewriter Logging & Recovery Query Optimizer Lock Manager Query Executor Files & Access Methods Lock Tables Buffers Buffer Manager Main Memory Storage Manager Storage

  3. SQL Constructs • SECLECT (DISTINCT) <list of columns> • FROM <list of tables> • WHERE <list of Boolean Factors> • GROUP BY <list of columns> • HAVING <list of Boolean Factors> • ORDER BY <list of columns>

  4. SQL Semantics • Take Cartesian product of FROM tables • Project only those referenced columns • WHERE: apply all filters in WHERE • GROUP BY: form groups on results • HAVING: apply filter to groups • ORDER BY: make sure results in right order • DISTINCT: remove duplicates • Q: Is this “operational semantics” efficient? • Different plans: mainly different in the first three

  5. Optimization: Different Strategies • Optimal approach: • enumerate each possible plan • measure its performance by running it • pick the fastest one • Heuristics approach: • fixed heuristics all the way through plan construction • e.g.: always nested loop joins, indexed relation as inner • e.g.: order relations from smallest to biggest • How does Selinger/System R differ from them? • What are the main components in their approach?

  6. New Paradigm: Cost-based Optimization • Plan space: • what is the space of query plans? • Cost estimation: • how to estimate the cost, without executing each? • Search algorithm: • how to search the space, as guided by cost estimates

  7. Space of Query Plans What can you do differently? Parameters to tune: • Selections: • Joins:

  8. Space of Query Plans • Selections: • algorithms: sequential, index scan • ordering: why this will matter? • Joins: • algorithms: nested-loop, sort merge, hash • ordering • Ordering/Grouping: • can an “interesting order” be produced by join/selections? • algorithm: sorting, hash-based • They interleave with each other!

  9. Assumptions to Reduce the Space? • Projections: • pushed down to reduce # of columns • Selections: • Joins:

  10. Huge Space! Assumptions to Help Typical assumptions to help reduce the space: • Projections: • pushed down to reduce # of columns • Selections: • pushed down to reduce # of rows • Joins: • left-deep joins (what else can you do?) • avoid Cartesian products; delay it in the plan • Q: how to avoid Cartesian products? • May miss an optimal plan!

  11. Cost/Size Estimation • Accurate relatively: • goal is to compare plans, not to predict exact cost • more of an art than an exact science • Each operator: input size, cost, output size • estimate cost based on input size • estimate output size (for next operator) or selectivity • selectivity = ratio of output to input

  12. Cost Estimation: Selinger Style • Input: simple DB statistics • # of tuples & disk pages • # of distinct values per column • statistics updated periodically • Assumption of attribute/predicate independence • When no estimate available, use magic number • New/Alternative approaches: • sampling, histogram of DB

  13. Selectivity Factors: Point Selection • [column = value] • [dept = CS] • input size: NCARD(Student) = 200, ICARD(dept) = 10 • output size = ? selectivity = ?

  14. Selectivity Factors: Range Selection • What is assumed in these formulas? • column > value1: • F = (maxValue - value1) / rangeOfValue • column in value1:value2 • F = (value2 - value1) / rangeOfValue • column in (set of values): • F: as union of point selections

  15. Selectivity Factors: Join Predicates • column1 = column2: • when both columns are keys: • student.sid = employee.sid • one key, the other foreign key • student.sid = enrollment.sid • student: NCARD = 200, SID distinct values = 200 • enrollment: NCARD = 800, SID distinct values = 100 • 800 * (200/200) = 800 • none are keys: • user.sid = enrollement.sid • user: NCARD = 1000, SID distinct values = 200 • enrollment: NCARD = 800, SID distinct values = 100 • 800 * (1000/200) = 4000 • given formula covers all?

  16. Goal: Cost Estimate • System R cost model: • Sum of I/O and CPU • (# PAGE) + W * (RSI CALLS) • why RSI calls model workload of RSS?

  17. Search the Plan Space • Baseline: exhaustive search • enumerate all combinations, and compare their cost • Search method parameters: • plan tree development: • construction: bottom-up, top-down • modification: improve a somehow-constructed tree • algorithms: • heuristic selections: make choices based on heuristics • branch and bound: search bounded by the current best tree • hill climbing: find “nearby” plans with lowest cost • dynamic programming: construction by greedy selections • where does System-R approach fit in?

  18. Plan Search: System-R Style AKA: Selinger style optimization • Bottom-up • start from ground relations (in FROM) • work up the tree to form a plan • Dynamic programming • greedily prune subtrees that are obviously useless • but not necessarily local best at every step: ? why • Many other approaches…

  19. DP: Local Optimal -> Global Optimal? Think of optimizing (T1 QT2 Q T3) Q T4 Q T5… • Sometime it does hold • does plan (T1 QT2 Q T3) affect the rest of the plan? • Not always true • example? • but hopefully at least give good plans

  20. System R Search: Start from Relations • Base relations access: • find all plans for accessing each base relations • push down “sargable” arguments • ? example of non-sargable predicates? • choose good plans, discard bad ones • keep cheapest for unordered & each interesting order • Join ordering: • find all ways of joining a pair of base relations • choose good plans, discard bad ones

  21. WHERE Clause and Sargable Predicates • WHERE conditions: [dept = CS] OR ([dept = EE] AND [GPA > 3.5]) • Normal forms: • DNF: (C) OR (E AND G) • CNF: (C OR E) AND (C OR G) • Difference in using the CNF/DNF “factors”

  22. System R Search: Left-Deep Join Plans • Join ordering: • consider only left-deep trees: n! ordering for n tables • basis: • find all ways of joining a pair of base relations • choose good plans, discard bad ones • induction: until you have a full plan • k to k+1: given plans of k-relation join, add one more • prefer one with predicate: postpone Cartesian product • can throw away k-plans • Finally: grouping/ordering • using interesting order • additional sorting

  23. Nested Subqueries • Subqueries optimized separately • Correlation: order of evaluations • uncorrelated queries are like “constants” • correlated queries are like “function calls”

  24. Beyond System-R • Parallel/Distributed DB • User-defined predicates • Materialized views • Adaptive query optimization • Fuzzy queries

  25. What You Should Know • Query optimization is critical for any commercial DBMS • good/bad can be orders of magnitude • The 3 components of the System-R style optimization • Plan space definition • Cost estimation • Search algorithm • huge number of alternative, semantically equivalent plans • Ideal goal: • map a declarative query to the most efficient plan • Conventional wisdom: avoid bad plans • State of the art: • industry: most optimizers are System-R style • academic: always a core database research topic

  26. Carry Away Messages • Appropriate formulation of the problem leads to good solutions • Query optimization as a search problem • Messy problems always require a clean model • Make assumptions to make the problem tractable • An approximate solution could be good enough • Opens up opportunities for further improvement

More Related