280 likes | 288 Views
Explore query optimization algorithms in DBMS, comparing cost factors like resource usage, response time, and I/O. Learn how query plans are executed and generated efficiently.
E N D
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 15 – Query Optimization
Query Optimization • Read : • Chapter 12, sec 12.4 • Chapter 15 • SAC+79 Pages • Purpose: • Study different algorithms to optimize queries submitted to the DBMS Dr. Manuel Rodriguez Martinez
Introduction • SQL query gets translated into relational algebra expression • Relational algebra expression is represented as tree • This is what DBMS “understands” how to process • Expression becomes a plan once we identify access methods for each operator • Relational algebra expression might have an equivalent expression • Example: R(A,B,C) , S(A, D, F) • But, each expression might have different cost • How do we find the cheapest expression? Dr. Manuel Rodriguez Martinez
Relational DBMS Architecture Client API Client Query Parser Query Optimizer Relational Operators Execution Engine File and Access Methods Concurrency and Recovery Buffer Management Disk Space Management DB Dr. Manuel Rodriguez Martinez
Query Optimizer • Module in DBMS in charge of finding cheapest available plan to execute a query • Building one is not easy! • Optimizer searches for plans and compares then based on cost • Cost can be: • Resource usage • Response time • Power consumption • Number of I/Os • Network transmission cost Dr. Manuel Rodriguez Martinez
Query Plans • Query plan specifies the operations to be executed • Tree of operators • Each operator corresponds to a relational operator • Leaf nodes usually represent base tables A,B A,B A>2 T R S R S S S Dr. Manuel Rodriguez Martinez
Executing Query Plans • Plans generated by the optimizer are fed to the execution engine • Plans support iterator interface • Open – initialize the operator • Next – get next tuple from operator • Close – de-allocate resources from operator • Execution engine invokes each method • Invocation triggers cascade of calls • Each operator call the corresponding methods on child nodes • Example: open on join, causes call to open on outer table and call to open on inner table. Same for next and close. Dr. Manuel Rodriguez Martinez
Pipelined vs Materialized Execution • Pipelined • The output tuple from one operator immediately becomes input tuple to its parent operator in the tree • Materialized • The output tuples from one operator must be stored to disk first (into a temporary table) • Once the operator finishes, its parent operator can access the materialed tuples • Most execution engines use pipelined • Saving in I/O can be substantial! • Some operator cannot be pipelined • Sorting, projections with duplicate elimination • Query optimizer must be aware of this issue! Dr. Manuel Rodriguez Martinez
Generation of Query Plans • Optimizer generates query plan after a search finds the optimal one • According to some criteria • Search is a search by construction • Alterative plans are built and compared • Cheapest one is kept • Two major algorithms exist • Dynamic programming (SAC+79) • Exhaustive search of plan space • Finds the optimal • Randomized Algorithm • Random search of plan space • Quickly finds sub-optimal but good plan • Optimization philosophy find a good plan quick, avoid bad ones! Dr. Manuel Rodriguez Martinez
Left-Deep Join vs Bushy Plans • Query Optimizer generate two major types of plans • Left-deep plans • Bushy plans • Left-deep plans • Every join has a base table as the inner join table • Use in commercial systems (first in System R) • Good for dynamic programming • Good for optimizing resource usage • Bushy plans • Joins might have intermediate tables as input to the join • Good for randomized search • Use in research prototypes for distributed databases • Good for optimizing response time Dr. Manuel Rodriguez Martinez
Left-deep plans T U R S R S T Each join always has base table as inner table R S Dr. Manuel Rodriguez Martinez
Bushy Plans R S R S T U V R S T U Dr. Manuel Rodriguez Martinez
Left-Deep vs Bushy Plans • Bushy plans • Enable parallelism in operator evaluation • Operator can execute at different rates • Good in distributed environments • More complicate to build (harder optimizer) • Left-deep plans • Joins are run in sequence • Susceptible to bottleneck at some operator • Simpler to build (easier optimizer) • Good for single site systems • Everything runs on the same machine • Commercial DBMS systems use Left-deep plans Dr. Manuel Rodriguez Martinez
Cost of a plan • The cost of plan depends on the metric you wish to optimize • Resource usage (CPU + I/O + Network) • Cost is the sum of the resources used by each operator • Response time • Cost of the slowest path in the tree • Number of I/Os • Cost is the sum of the I/Os generated by each operator • Network cost • Cost is sum of cost in moving data between operators Dr. Manuel Rodriguez Martinez
Organization of an optimizer Query Parser SQL Query Parse Tree Catalog Manager Plan Generator Cost Estimator Query Optimizer Query Plan Catalog Query Execution Engine Dr. Manuel Rodriguez Martinez
Generating Alternatives • Relational equivalences are used by the optimizer to generate different operators that do the same • Selection equivalences: • Cascading selections • Commutative selections Dr. Manuel Rodriguez Martinez
Equivalence Rules • Projections • Cascading projections • Joins • Commutative rule • Associative rule Dr. Manuel Rodriguez Martinez
Equivalence Rules(2) • Commute selections and projections • Pushing selections • Decomposing selections Dr. Manuel Rodriguez Martinez
System R Optimizer • Based on left-deep plans and dynamic programming • Most commercial systems use a System R type of optimizer • Cost is based on resource usage • Cost = CPU Cost + I/O Cost • Given a plan P, cost of P is computed as • Cost(P) = operatorCost(P.root) + Cost(P.root.leftChild) + Cost(P.root.rightChild) Dr. Manuel Rodriguez Martinez
Estimating Cost of Operators • Key feature for this is selectivity factors, selectivity, and join costs • Example: • R has no index • |R| = 100,000, ||R|| = 5000 • S has un-clustered Index on Join attribute • |S| = 70,000, ||S|| = 2500 • What algorithm shall be use for join? • Chose between: BNLJ, INLJ, e GHJ with 20 B • What is the cost? R S Dr. Manuel Rodriguez Martinez
System R Search Algorithm • Idea: • Build every possible plan and keep track of • Cheapest plan (overall) • Cheapest plans that bring data in sorted order (called interesting orders) • Dynamic Programming (divide-and-conquer) • To find plan for n-way join you • First find singe table plans • Then find plans for all (n-1)-way join and find a plan to join missing table with an (n-1)-way join • Plan for smaller joins are saved on a table Dr. Manuel Rodriguez Martinez
System R Search Algorithm (2) • Process: • For an n-way join between tables R1, R2, …, Rn: • Find the access path to access each table • Plans access to get R1, R2, …, Rn • This includes application of selection and projections for each table • Find the access path to compute 2-way joins • 2-way joins for all possible pairs of tables • Find the access path to compute 3-way joins • Add a table to a 2-way join (forms all possible 3-way joins) • Find the access path to compute 4-way joins • Add a table to a 3-way join (forms all possible 4-way joins) • … • Find the access path to compute the n-way join • Add a table to a n-1 way join (forms all possible n-way joins) Dr. Manuel Rodriguez Martinez
System R Search Algorithm (2) Plan SystemROptimizer(R1, R2, …, Rn){ for (int i = 0; i < n; ++i){ // single table access paths optPlan(Ri) = selectPlan(Ri); } for (int i=2; I < n; ++i){ // join access paths, start with 2 tables, then 3, …, for all S {R1, R2, R3, …Rn} s.t. |S| == i { // S is the next set to join bestPlan = dummy plan with infinite cost; for all Rj, Sj s.t. S = Sj {Rj} { // Sj & Rj are pieces of S P = joinPlan(optPlan(Sj), optPlan(Rj)) if (cost(P) < cost(bestPlan)){ bestPlan = P; } } optPlan(Sj) = bestPlan; } } S = {R1, R2, …, Rn); return optPlan(S); } Dr. Manuel Rodriguez Martinez
Illustration • How does the algorithm works for this case • Tables • R - |R| = 100,000, ||R|| = 8,000 • S - |S| = 90,000, ||S|| = 6,000 • T - |T| = 120,000 ||T|| = 10,000 • U - |U| = 80,000 ||U|| = 4,000 • All tables are stored in heap files • DBMS has: Blocked nested loops join, Hash Join and 25 free buffers • What is the best plan for query: • RS T U Dr. Manuel Rodriguez Martinez
Illustration (2) • How does the algorithm works for this case • Tables • R - |R| = 100,000, ||R|| = 8,000 • R is stored on a clustered B+tree matching join attribute with T • S - |S| = 90,000, ||S|| = 6,000 • |R| is stored on • T - |T| = 120,000 ||T|| = 10,000 • DBMS has: Blocked nested loops join, Indexed-nested loops join, and Hash Join and 3 free buffers • What is the best plan for query: • RS T Dr. Manuel Rodriguez Martinez
Illustration (3) • How does the algorithm works for this case • Tables • R - |R| = 100,000, ||R|| = 8,000 • R is stored on a clustered B+tree matching join attribute with T • S - |S| = 90,000, ||S|| = 6,000 • |R| is stored on • T - |T| = 120,000 ||T|| = 10,000 • DBMS has: Blocked nested loops join, Indexed-nested loops join, and Hash Join and 25 free buffers • What is the best plan for query: • A>3 B = ‘NY’ (R)S T • If SFA>3 = .10 and SF B= ‘NY’ = 0.05 and A>3 matches index on R. Dr. Manuel Rodriguez Martinez
Issues with SystemR optimizer • Algorithm performs exhaustive search of left-deep plans • Dynamic Programming is ill-suited for optimization of response time • Principle of optimality is not observed • Difficult (but not impossible) to modify for bushy plans • Search space is huge • Need pruning techniques to cut on the number of plans stored • Do we need exhaustive search? • Optimal plan vs sub-optimal that is good and quick to find • Disaster avoidance – More important to avoid bad plans!!! Dr. Manuel Rodriguez Martinez
Alternative approaches • Randomized Query Optimization • Use randomized algorithms to build and search plans • Good for bushy plans • Rule-based Query Optimization • Use rules to guide the search and better prune space • Good to apply special cases and pruning • Parametric Query Optimization • Add run-time parameters to really capture the reality of the system • Multiple-query Optimization • Optimizer takes 2 or more queries at a time for optimization Dr. Manuel Rodriguez Martinez