190 likes | 268 Views
Learn about optimizing database queries, comparing access plans, cost estimation, and plan selection in SQL query processing. Covers dynamic programming, correlated subqueries, and query rewrite techniques.
E N D
Access Path Selection in a Relational Database Management System Selinger et al.
Query Optimization • Declarative query language relieves programmer of the burden to choose an access plan. • Difference between good and bad access plans can be several orders of magnitude. • Problem: How do we find a good plan?
How to choose a good plan? Divide into three problems: • What is the set of plans we consider? • How do we compare (cost) plans? • How do we choose a “good” plan from this set?
Background: SQL SELECT {DISTINCT} <list of columns> FROM <list of tables> {WHERE <list of "Boolean Factors“(predicates in CNF)>} {GROUP BY <list of columns> {HAVING <list of Boolean Factors>}} {ORDER BY <list of columns>};
Processing SQL Statements • Four steps: • Parsing • Optimization • Code generation • Execution
SQL Query Types • Single block queries only • optimize nested sub-queries separately • Correlated sub-queries are much harder and much more expensive than un-correlated sub-queries. • Rewrite to remove correlations where possible. (Tricky, more about this later.) • SPJ queries only in this paper
Problem 1: Plan Space • Fixed set of individual access methods • Sequential & index (clustered/unclustered) scans • NL-join, (sort)-merge join, hash join • Sorting & hash-based grouping • Plan flows in a non-blocking fashion with get-next iterators
Plan Space (Contd.) • Assumptions in System R: • Selections of “sargable” predicates are "pushed down" • Projections are "pushed down" • Single query blocks • Only left-deep plan trees (There are n! plans (not factoring in choice of join method)) • Avoid Cartesian products
Problem 2: How to Cost a Plan • Estimation of input and intermediate cardinalities based on simple statistical models (e.g., uniform distribution assumption, attribute independence) • Estimation of costs for each operator based on statistics about input relations • Cost is weighted function between I/O and CPU(no distinction between random and sequential IO)
Cost Estimation (Selinger) • Maintenance of simple statistics: • # of tuples & pages • # of values per column (only for indexed columns) • Assumption of attribute independence • When no estimate available, use magic number • These estimations are done periodically
Cost Estimation (Today) • Sampling: so far only concrete results for base relations • Histograms: getting better. Common in industry, some interesting new research. • Controlling "error propagation"
Problem 3: Choosing a Plan • Exhaustive search • Dynamic Programming (prunes suboptimal parts of the search space): System R • Top-down, transformative version of DP: Volcano, Cascades (used in MS SQL Server?) • Randomized search algorithms (e.g. Ioannidis & Kang) • Techniques from Operations Research • Etc.
System R Approach • Recall: Only left-deep plan trees (n! different plans) • Observation: Many of these plans share common prefixes, so do not recompute all of them: Dynamic Programming
Dynamic Programming Approach • Find all 1-table plans for each base relation • Try all ways of joining i-table plans saved so far with 1-table plans. Save cheapest unordered (i+1)-table plans, and cheapest (i+1)-table plans for each interesting order • Note: secondary join predicates are just like selections that can’t be pushed down • At the end, GROUP BY and ORDER BY • Use plan in interesting order, or add sort to cheapest unordered plan.
Evaluation • Complexity of dynamic programming: about n2n-1, intermediate storage: plans • “No-cartesian-products” rule can make a big difference for some queries. • DP only works up to 10-15 joins • Adding parameters to the search space makes things worse (e.g. expensive predicates, distribution, parallelism, etc.)
Nested Queries • Subqueries optimized separately • Uncorrelated vs. correlated subqueries • Uncorrelated subqueries are basically constants to be computed once • Correlated subqueries are like function calls
Query Rewrite in IBM DB2 Leung et al.
Why Query Rewrite? • Problem: • Very complex queries automatically generated through tools with many levels of subqueries • Correlated subqueries • Selinger approach only works one block at a time • Main idea: Transform the query into a simpler, equivalent query
SELECT P.pno FROM Parts P WHERE P.quantity = (SELECT COUNT(*) FROM Supply S WHERE S.pno = P.pno AND S.shipdate < 1-1-2000) INSERT INTO Temp (pnum, cnt) (SELECT P.pno, COUNT(*) FROM Supply S WHERE S.shipdate < 1-1-2000 GROUP BY P.pno) SELECT P.pno FROM Parts P, Temp T WHERE P.quantity = T.cnt AND P.pno = T.pno Query Rewrite is Tricky