1 / 19

Access Path Selection in a Relational Database Management System

Access Path Selection in a Relational Database Management System. Selinger et al. Query Optimization. Declarative query language relieves programmer of the burden to choose an access plan. Difference between good and bad access plans can be several orders of magnitude.

Download Presentation

Access Path Selection in a Relational Database Management System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Access Path Selection in a Relational Database Management System Selinger et al.

  2. Query Optimization • Declarative query language relieves programmer of the burden to choose an access plan. • Difference between good and bad access plans can be several orders of magnitude. • Problem: How do we find a good plan?

  3. How to choose a good plan? Divide into three problems: • What is the set of plans we consider? • How do we compare (cost) plans? • How do we choose a “good” plan from this set?

  4. Background: SQL SELECT {DISTINCT} <list of columns> FROM <list of tables> {WHERE <list of "Boolean Factors“(predicates in CNF)>} {GROUP BY <list of columns> {HAVING <list of Boolean Factors>}} {ORDER BY <list of columns>};

  5. Processing SQL Statements • Four steps: • Parsing • Optimization • Code generation • Execution

  6. SQL Query Types • Single block queries only • optimize nested sub-queries separately • Correlated sub-queries are much harder and much more expensive than un-correlated sub-queries. • Rewrite to remove correlations where possible. (Tricky, more about this later.) • SPJ queries only in this paper

  7. Problem 1: Plan Space • Fixed set of individual access methods • Sequential & index (clustered/unclustered) scans • NL-join, (sort)-merge join, hash join • Sorting & hash-based grouping • Plan flows in a non-blocking fashion with get-next iterators

  8. Plan Space (Contd.) • Assumptions in System R: • Selections of “sargable” predicates are "pushed down" • Projections are "pushed down" • Single query blocks • Only left-deep plan trees (There are n! plans (not factoring in choice of join method)) • Avoid Cartesian products

  9. Problem 2: How to Cost a Plan • Estimation of input and intermediate cardinalities based on simple statistical models (e.g., uniform distribution assumption, attribute independence) • Estimation of costs for each operator based on statistics about input relations • Cost is weighted function between I/O and CPU(no distinction between random and sequential IO)

  10. Cost Estimation (Selinger) • Maintenance of simple statistics: • # of tuples & pages • # of values per column (only for indexed columns) • Assumption of attribute independence • When no estimate available, use magic number • These estimations are done periodically

  11. Cost Estimation (Today) • Sampling: so far only concrete results for base relations • Histograms: getting better. Common in industry, some interesting new research. • Controlling "error propagation"

  12. Problem 3: Choosing a Plan • Exhaustive search • Dynamic Programming (prunes suboptimal parts of the search space): System R • Top-down, transformative version of DP: Volcano, Cascades (used in MS SQL Server?) • Randomized search algorithms (e.g. Ioannidis & Kang) • Techniques from Operations Research • Etc.

  13. System R Approach • Recall: Only left-deep plan trees (n! different plans) • Observation: Many of these plans share common prefixes, so do not recompute all of them: Dynamic Programming

  14. Dynamic Programming Approach • Find all 1-table plans for each base relation • Try all ways of joining i-table plans saved so far with 1-table plans. Save cheapest unordered (i+1)-table plans, and cheapest (i+1)-table plans for each interesting order • Note: secondary join predicates are just like selections that can’t be pushed down • At the end, GROUP BY and ORDER BY • Use plan in interesting order, or add sort to cheapest unordered plan.

  15. Evaluation • Complexity of dynamic programming: about n2n-1, intermediate storage:  plans • “No-cartesian-products” rule can make a big difference for some queries. • DP only works up to 10-15 joins • Adding parameters to the search space makes things worse (e.g. expensive predicates, distribution, parallelism, etc.)

  16. Nested Queries • Subqueries optimized separately • Uncorrelated vs. correlated subqueries • Uncorrelated subqueries are basically constants to be computed once • Correlated subqueries are like function calls

  17. Query Rewrite in IBM DB2 Leung et al.

  18. Why Query Rewrite? • Problem: • Very complex queries automatically generated through tools with many levels of subqueries • Correlated subqueries • Selinger approach only works one block at a time • Main idea: Transform the query into a simpler, equivalent query

  19. SELECT P.pno FROM Parts P WHERE P.quantity = (SELECT COUNT(*) FROM Supply S WHERE S.pno = P.pno AND S.shipdate < 1-1-2000) INSERT INTO Temp (pnum, cnt) (SELECT P.pno, COUNT(*) FROM Supply S WHERE S.shipdate < 1-1-2000 GROUP BY P.pno) SELECT P.pno FROM Parts P, Temp T WHERE P.quantity = T.cnt AND P.pno = T.pno Query Rewrite is Tricky

More Related