1 / 65

CS411 Database Systems

CS411 Database Systems. 10: Query Processing (continued). Query Optimization. History. Patricia Selinger wrote two papers - DB security - Query optimization She has been a big factor in IBM’s success. Chapter 16: Optimization. Step 1: convert the SQL query to a logical plan

makoto
Download Presentation

CS411 Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS411Database Systems 10: Query Processing (continued)

  2. Query Optimization

  3. History Patricia Selinger wrote two papers - DB security - Query optimization She has been a big factor in IBM’s success

  4. Chapter 16: Optimization Step 1: convert the SQL query to a logical plan Step 2: find a better logical plan, find an associated physical plan

  5. SQL –> Logical Query Plans

  6. Basic SQL queries are easy to convert to logical plans SELECT a1, …, an FROM R1, …, Rk WHERE C Pa1,…,an(sC(R1 x R2 x … x Rk)) SELECT a1, …, an FROM R1, …, Rk WHERE C GROUP BY b1, …, bl Pa1,…,an(gb1, …, bm, aggs (sC(R1 x … x Rk)))

  7. Some nested queries can be flattened Selectdistinct product.name From product Where product.maker in (Select company.name From company where company.city=“Urbana”) Selectdistinct product.name From product, company Where product.maker = company.name AND company.city=“Urbana”

  8. Some nested queries are harder Selectdistinct x.name, x.maker From product x Where x.color= “blue” AND x.price >= ALL (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”)

  9. Selectdistinct x.name, x.maker From product x Where x.color= “blue” AND x.price < SOME (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”) Let’s compute the complement first:

  10. Select distinct x.name, x.maker From product x, product y Where x.color= “blue” AND x.maker = y.maker AND y.color=“blue” AND x.price < y.price Now we can flatten it: This returns exactly the products we don’t want, so…

  11. A set difference operator finishes the job (Select x.name, x.maker From product x Where x.color = “blue”) EXCEPT (Select x.name, x.maker From product x, product y Where x.color= “blue” AND x.maker = y.maker AND y.color=“blue” AND x.price < y.price)

  12. Now rewrite the logical plan to an equivalent but better one Will probably run faster Same query answer Heuristic: likely to result in cheaper plans Optimizer uses algebraic laws Equivalent Original Cost-based: estimate size and cost of intermediate results, search systematically for best plan

  13. Algebraic Laws

  14. Algebraic laws Commutative and Associative Laws R  S = S  R, R  (S  T) = (R  S)  T R ∩ S = S ∩ R, R ∩ (S ∩ T) = (R ∩ S) ∩ T R ⋈ S = S ⋈ R, R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T Distributive Laws R ⋈ (S  T) = (R ⋈ S)  (R ⋈ T)

  15. Algebraic laws about selections sC AND D (R) = sC(sD(R)) = sC(R) ∩ sD(R) sC OR D (R) = sC(R) sD(R) sC(R  S) = sC(R) sC(S) sC(R ⋈ S) = sC(R) ⋈ S sC(R – S) = sC(R) – S sC(R ∩ S) = sC(R) ∩ S if C involves only attributes of R

  16. R(A,B,C,D) S(E,F,G) sF=3 (R ⋈D=E S) = sA=5 AND G=9 (R ⋈D=E S) =

  17. Algebraic laws for projection PM(R ⋈ S) = PN(PP(R) ⋈PQ(S)) where N, P, Q are appropriate subsets of attributes of M PM(PN(R)) = PM,N(R) R(A,B,C,D) S(E,F,G) PA,B,G(R ⋈D=E S) = P ? (P?(R) ⋈D=EP?(S))

  18. Algebraic laws for grouping and aggregation  (A, agg(B)(R)) = A, agg(B)(R) A, agg(B)((R)) = A, agg(B)(R), if agg is duplicate insensitive The book describes additional algebraic laws, but even the book doesn’t cover them all. SUM COUNT AVG MIN MAX

  19. Heuristics-based Optimization- or – Do projections and selections as early as possible

  20. Push selection predicates down the tree, to make intermediate results smaller pname pname sprice > 100 AND city = “Urbana” maker = name maker = name city = “Urbana” price > 100 Company Product Company Product (but may cause us to lose an important ordering of the tuples, if we use indexes)

  21. For each company with a product costing more than $100, find the max price of its products SELECT y.name, Max(x.price) FROM product x, company y WHERE x.maker=y.name and x.price > 100 GROUP BYy.name HAVING Max(x.price) > 100 SELECTy.name, Max(x.price) FROMproduct x, company y WHEREx.maker = y.name GROUP BY y.name HAVING Max(x.price) > 100 Advantage: the size of the join will be smaller. Requires transformation rules specific to the grouping/aggregation operators: won’t work if we replace MAX by MIN.

  22. Sometimes we should push selections up… Bargain view V1: categories with some price<20, and the cheapest price Select V2.name, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.name, x.category, x.price From product x, company y Where x.maker=y.name

  23. Sometimes we should push selections up… Bargain view V1: categories with some price<20, and the cheapest price Select V2.name, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price AND V1.p < 20 Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.name, x.category, x.price From product x, company y Where x.maker=y.name

  24. … and then down. Bargain view V1: categories with some price<20, and the cheapest price Select V2.name, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price AND V1.p < 20 Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.name, x.category, x.price From product x, company y Where x.maker=y.name AND V1.p < 20

  25. Cost-based Optimization

  26. The simplest version of cost-based estimation: generate all plans, estimate their costs, pick the cheapest Equivalent plans Too slow, so only look at the plans likely to be the fastest $ $ $ $ $ $ $ $ $ $ $ $ $ $

  27. Often: generate a partial plan, optimize it, then add another operator, … Top-down: the partial plan is a top fragment of the logical plan Bottom up: the partial plan is a bottom fragment of the logical plan

  28. We can use any of the search strategies that you learned about in your AI class Branch-and-bound: • Remember the cheapest complete plan P seen so far and its cost C • Stop generating partial plan once its cost is > C • If a cheaper complete plan is found, replace P, C Hill climbing: • Remember only the cheapest partial plan seen so far Dynamic programming: • Remember all the cheapest partial plans

  29. The order that relations are joined in has a huge impact on performance Given: query R1 ⋈ … ⋈Rn, function cost( ), find the best join tree for the query R3 R1 R2 R4 Plan = tree Partial plan = subtree

  30. Left deep join tree R4 R2 R5 R3 R1

  31. Right deep join tree R3 R1 R5 R2 R4

  32. Bushy join tree R3 R2 R4 R5 R1

  33. Dynamic programming is a good (bottom-up) way to choose join ordering Find the best plan for each subquery Q of {R1, …, Rn}: • {R1}, …, {Rn} • {R1, R2}, {R1, R3}, …, {Rn-1, Rn} • {R1, R2, R3}, {R1, R2, R4}, … • … • {R1, …, Rn} Output: • Size(Q) • A best plan Plan(Q) • Cost(Q)

  34. The ith step of the dynamic program For each Q ⊆ {R1, …, Rn} of size i do: • Compute Size(Q) (later…) • For every pair Q1, Q2 such that Q = Q1  Q2,compute cost(Plan(Q1) ⋈ Plan(Q2)) Cost(Q) = the smallest such cost Plan(Q) = the corresponding plan

  35. Our next example (over)simplifies: Cost of a scan = 0 Cost(P1 ⋈ P2) = Cost(P1) + Cost(P2) + size(intermediate results) Intermediate result size: • 0, if not a join • Else something more realistic Cost(R5) = Cost(R7) = 0 Cost(R5 ⋈ R7) = 0 Size(R5) = Size(R7) = 0 Don’t do this at home

  36. R ⋈ S ⋈ T ⋈ U Number of tuples: R = 2000 S = 5000 T = 3000 U = 1000 Size estimate: size(A ⋈ B) = .01*size(A) *size(B) Unrealistic!

  37. R ⋈ S ⋈ T ⋈ U Number of tuples: R = 2000 S = 5000 T = 3000 U = 1000 Size estimate: size(A ⋈B) = .01*size(A) *size(B) Unrealistic!

  38. Join order options for RSTU • Cost of (RST)U = 60K + 0 + 3M + 0 • Cost of (RSU)T = 20K + 0 + 1M + 0 • Cost of (RTU)S = 20K + 0 + .6M + 0 • Cost of (STU)R = 30K + 0 + 1.5M + 0 • Cost of (RS)(TU) = 0 + 0 + 100K + 30K • Cost of (RT)(SU) = 0 + 0 + 60K + 50K • Cost of (RU)(TS) = 0 + 0 + 20K + 150K

  39. What if we don’t oversimplify? • More realistic size/cost estimations!! (next slides) • Use heuristics to reduce the search space • Consider only left linear trees • No trees with cartesian products: R(A,B) S(B,C) T(C,D) (R ⋈ T) ⋈ S has a cartesian product

  40. Summary of query optimization process so far Discussion in the book is very nice • Parse your query into tree form • Move selections as far down the tree as you can • Project out unwanted attributes as early as you can, when you have their tuples in memory anyway • Pick a good join order, based on the expected size of intermediate results • Pick an implementation for each operation in the tree

  41. Completing a Physical Query Plan

  42. Completing the Physical Query Plan • Choose algorithm to implement each operator Need to consider more than I/O cost: • How much memory do we have ? • Are the input operand(s) sorted ? • Decide for each intermediate result: • Materialize • Pipeline

  43. One option is to materialize intermediate results between operators HashTable Srepeat read(R, x) y  join(HashTable, x)write(V1, y) HashTable  T repeat read(V1, y) z  join(HashTable, y)write(V2, z) HashTable  Urepeat read(V2, z) u  join(HashTable, z)write(Answer, u) Cost = ? Memory = ? ⋈ V2 ⋈ U V1 ⋈ T R S

  44. pipeline The second option is to pipeline between operators HashTable1  SHashTable2  T HashTable3  Urepeat read(R, x) y  join(HashTable1, x) z  join(HashTable2, y) u  join(HashTable3, z)write(Answer, u) Cost = ? Memory = ? ⋈ ⋈ U ⋈ T R S

  45. Example 16.36 Logical plan: Main memory M = 101 blocks of space k blocks U(y,z) 10,000 blocks R(w,x) 5,000 blocks S(x,y) 10,000 blocks

  46. kblocks U(y,z) 10,000 blocks R(w,x) 5,000 blocks S(x,y) 10,000 blocks Example 16.36 Naive evaluation: 2 partitioned hash-joins, materialized (Make sure buckets fit in memory!) Cost 3B(R) + 3B(S) + 4k+ 3B(U) = 75000 + 4k

  47. kblocks U(y,z) 10,000 blocks R(w,x) 5,000 blocks S(x,y) 10,000 blocks Example 16.36 Smarter: • Step 1: hash R on x into 100 buckets, each of 50 blocks; to disk • Step 2: hash S on x into 100 buckets; to disk • Step 3: read each R bucket in memory (50 buffers at a time), join with S (1 buffer at a time); hash result on y into 50 buckets (50 buffers) -- here we pipeline Cost so far: 3B(R) + 3B(S)

  48. kblocks U(y,z) 10,000 blocks R(w,x) 5,000 blocks S(x,y) 10,000 blocks Example 16.36 Continuing: • How large are the 50 buckets on y? k/50 blocks each. • If k <= 50 then keep all 50 buckets in Step 3 in memory, then: • Step 4: read U from disk, hash on y and join in memory • Total cost: 3B(R) + 3B(S) + B(U) = 55,000

More Related