190 likes | 265 Views
Access Path Selection in a Relational Database Management System. Selinger et al. Query Optimization. Declarative query language relieves programmer of the burden to choose an access plan. Difference between good and bad access plans can be several orders of magnitude.
E N D
Access Path Selection in a Relational Database Management System Selinger et al.
Query Optimization • Declarative query language relieves programmer of the burden to choose an access plan. • Difference between good and bad access plans can be several orders of magnitude. • Problem: How do we find a good plan?
How to choose a good plan? Divide into three problems: • What is the set of plans we consider? • How do we compare (cost) plans? • How do we choose a “good” plan from this set?
Background: SQL SELECT {DISTINCT} <list of columns> FROM <list of tables> {WHERE <list of "Boolean Factors“(predicates in CNF)>} {GROUP BY <list of columns> {HAVING <list of Boolean Factors>}} {ORDER BY <list of columns>};
Processing SQL Statements • Four steps: • Parsing • Optimization • Code generation • Execution
SQL Query Types • Single block queries only • optimize nested sub-queries separately • Correlated sub-queries are much harder and much more expensive than un-correlated sub-queries. • Rewrite to remove correlations where possible. (Tricky, more about this later.) • SPJ queries only in this paper
Problem 1: Plan Space • Fixed set of individual access methods • Sequential & index (clustered/unclustered) scans • NL-join, (sort)-merge join, hash join • Sorting & hash-based grouping • Plan flows in a non-blocking fashion with get-next iterators
Plan Space (Contd.) • Assumptions in System R: • Selections of “sargable” predicates are "pushed down" • Projections are "pushed down" • Single query blocks • Only left-deep plan trees (There are n! plans (not factoring in choice of join method)) • Avoid Cartesian products
Problem 2: How to Cost a Plan • Estimation of input and intermediate cardinalities based on simple statistical models (e.g., uniform distribution assumption, attribute independence) • Estimation of costs for each operator based on statistics about input relations • Cost is weighted function between I/O and CPU(no distinction between random and sequential IO)
Cost Estimation (Selinger) • Maintenance of simple statistics: • # of tuples & pages • # of values per column (only for indexed columns) • Assumption of attribute independence • When no estimate available, use magic number • These estimations are done periodically
Cost Estimation (Today) • Sampling: so far only concrete results for base relations • Histograms: getting better. Common in industry, some interesting new research. • Controlling "error propagation"
Problem 3: Choosing a Plan • Exhaustive search • Dynamic Programming (prunes suboptimal parts of the search space): System R • Top-down, transformative version of DP: Volcano, Cascades (used in MS SQL Server?) • Randomized search algorithms (e.g. Ioannidis & Kang) • Techniques from Operations Research • Etc.
System R Approach • Recall: Only left-deep plan trees (n! different plans) • Observation: Many of these plans share common prefixes, so do not recompute all of them: Dynamic Programming
Dynamic Programming Approach • Find all 1-table plans for each base relation • Try all ways of joining i-table plans saved so far with 1-table plans. Save cheapest unordered (i+1)-table plans, and cheapest (i+1)-table plans for each interesting order • Note: secondary join predicates are just like selections that can’t be pushed down • At the end, GROUP BY and ORDER BY • Use plan in interesting order, or add sort to cheapest unordered plan.
Evaluation • Complexity of dynamic programming: about n2n-1, intermediate storage: plans • “No-cartesian-products” rule can make a big difference for some queries. • DP only works up to 10-15 joins • Adding parameters to the search space makes things worse (e.g. expensive predicates, distribution, parallelism, etc.)
Nested Queries • Subqueries optimized separately • Uncorrelated vs. correlated subqueries • Uncorrelated subqueries are basically constants to be computed once • Correlated subqueries are like function calls
Query Rewrite in IBM DB2 Leung et al.
Why Query Rewrite? • Problem: • Very complex queries automatically generated through tools with many levels of subqueries • Correlated subqueries • Selinger approach only works one block at a time • Main idea: Transform the query into a simpler, equivalent query
SELECT P.pno FROM Parts P WHERE P.quantity = (SELECT COUNT(*) FROM Supply S WHERE S.pno = P.pno AND S.shipdate < 1-1-2000) INSERT INTO Temp (pnum, cnt) (SELECT P.pno, COUNT(*) FROM Supply S WHERE S.shipdate < 1-1-2000 GROUP BY P.pno) SELECT P.pno FROM Parts P, Temp T WHERE P.quantity = T.cnt AND P.pno = T.pno Query Rewrite is Tricky