CS411 Database Systems

CS411Database Systems 10: Query Processing (continued)

Query Optimization

History Patricia Selinger wrote two papers - DB security - Query optimization She has been a big factor in IBM’s success

Chapter 16: Optimization Step 1: convert the SQL query to a logical plan Step 2: find a better logical plan, find an associated physical plan

SQL –> Logical Query Plans

Basic SQL queries are easy to convert to logical plans SELECT a1, …, an FROM R1, …, Rk WHERE C Pa1,…,an(sC(R1 x R2 x … x Rk)) SELECT a1, …, an FROM R1, …, Rk WHERE C GROUP BY b1, …, bl Pa1,…,an(gb1, …, bm, aggs (sC(R1 x … x Rk)))

Some nested queries can be flattened Selectdistinct product.name From product Where product.maker in (Select company.name From company where company.city=“Urbana”) Selectdistinct product.name From product, company Where product.maker = company.name AND company.city=“Urbana”

Some nested queries are harder Selectdistinct x.name, x.maker From product x Where x.color= “blue” AND x.price >= ALL (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”)

Selectdistinct x.name, x.maker From product x Where x.color= “blue” AND x.price < SOME (Select y.price From product y Where x.maker = y.maker AND y.color=“blue”) Let’s compute the complement first:

Select distinct x.name, x.maker From product x, product y Where x.color= “blue” AND x.maker = y.maker AND y.color=“blue” AND x.price < y.price Now we can flatten it: This returns exactly the products we don’t want, so…

A set difference operator finishes the job (Select x.name, x.maker From product x Where x.color = “blue”) EXCEPT (Select x.name, x.maker From product x, product y Where x.color= “blue” AND x.maker = y.maker AND y.color=“blue” AND x.price < y.price)

Now rewrite the logical plan to an equivalent but better one Will probably run faster Same query answer Heuristic: likely to result in cheaper plans Optimizer uses algebraic laws Equivalent Original Cost-based: estimate size and cost of intermediate results, search systematically for best plan

Algebraic Laws

Algebraic laws Commutative and Associative Laws R  S = S  R, R  (S  T) = (R  S)  T R ∩ S = S ∩ R, R ∩ (S ∩ T) = (R ∩ S) ∩ T R ⋈ S = S ⋈ R, R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T Distributive Laws R ⋈ (S  T) = (R ⋈ S)  (R ⋈ T)

Algebraic laws about selections sC AND D (R) = sC(sD(R)) = sC(R) ∩ sD(R) sC OR D (R) = sC(R) sD(R) sC(R  S) = sC(R) sC(S) sC(R ⋈ S) = sC(R) ⋈ S sC(R – S) = sC(R) – S sC(R ∩ S) = sC(R) ∩ S if C involves only attributes of R

R(A,B,C,D) S(E,F,G) sF=3 (R ⋈D=E S) = sA=5 AND G=9 (R ⋈D=E S) =

Algebraic laws for projection PM(R ⋈ S) = PN(PP(R) ⋈PQ(S)) where N, P, Q are appropriate subsets of attributes of M PM(PN(R)) = PM,N(R) R(A,B,C,D) S(E,F,G) PA,B,G(R ⋈D=E S) = P ? (P?(R) ⋈D=EP?(S))

Algebraic laws for grouping and aggregation  (A, agg(B)(R)) = A, agg(B)(R) A, agg(B)((R)) = A, agg(B)(R), if agg is duplicate insensitive The book describes additional algebraic laws, but even the book doesn’t cover them all. SUM COUNT AVG MIN MAX

Heuristics-based Optimization- or – Do projections and selections as early as possible

Push selection predicates down the tree, to make intermediate results smaller pname pname sprice > 100 AND city = “Urbana” maker = name maker = name city = “Urbana” price > 100 Company Product Company Product (but may cause us to lose an important ordering of the tuples, if we use indexes)

For each company with a product costing more than $100, find the max price of its products SELECT y.name, Max(x.price) FROM product x, company y WHERE x.maker=y.name and x.price > 100 GROUP BYy.name HAVING Max(x.price) > 100 SELECTy.name, Max(x.price) FROMproduct x, company y WHEREx.maker = y.name GROUP BY y.name HAVING Max(x.price) > 100 Advantage: the size of the join will be smaller. Requires transformation rules specific to the grouping/aggregation operators: won’t work if we replace MAX by MIN.

Sometimes we should push selections up… Bargain view V1: categories with some price<20, and the cheapest price Select V2.name, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.name, x.category, x.price From product x, company y Where x.maker=y.name

Sometimes we should push selections up… Bargain view V1: categories with some price<20, and the cheapest price Select V2.name, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price AND V1.p < 20 Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.name, x.category, x.price From product x, company y Where x.maker=y.name

… and then down. Bargain view V1: categories with some price<20, and the cheapest price Select V2.name, V2.price From V1, V2 Where V1.category = V2.category and V1.p = V2.price AND V1.p < 20 Create View V1 AS Select x.category, Min(x.price) AS p From product x Where x.price < 20 GroupBy x.category Create View V2 AS Select y.name, x.category, x.price From product x, company y Where x.maker=y.name AND V1.p < 20

Cost-based Optimization

The simplest version of cost-based estimation: generate all plans, estimate their costs, pick the cheapest Equivalent plans Too slow, so only look at the plans likely to be the fastest $ $ $ $ $ $ $ $ $ $ $ $ $ $

Often: generate a partial plan, optimize it, then add another operator, … Top-down: the partial plan is a top fragment of the logical plan Bottom up: the partial plan is a bottom fragment of the logical plan

We can use any of the search strategies that you learned about in your AI class Branch-and-bound: • Remember the cheapest complete plan P seen so far and its cost C • Stop generating partial plan once its cost is > C • If a cheaper complete plan is found, replace P, C Hill climbing: • Remember only the cheapest partial plan seen so far Dynamic programming: • Remember all the cheapest partial plans

The order that relations are joined in has a huge impact on performance Given: query R1 ⋈ … ⋈Rn, function cost( ), find the best join tree for the query R3 R1 R2 R4 Plan = tree Partial plan = subtree

Left deep join tree R4 R2 R5 R3 R1

Right deep join tree R3 R1 R5 R2 R4

Bushy join tree R3 R2 R4 R5 R1

Dynamic programming is a good (bottom-up) way to choose join ordering Find the best plan for each subquery Q of {R1, …, Rn}: • {R1}, …, {Rn} • {R1, R2}, {R1, R3}, …, {Rn-1, Rn} • {R1, R2, R3}, {R1, R2, R4}, … • … • {R1, …, Rn} Output: • Size(Q) • A best plan Plan(Q) • Cost(Q)

The ith step of the dynamic program For each Q ⊆ {R1, …, Rn} of size i do: • Compute Size(Q) (later…) • For every pair Q1, Q2 such that Q = Q1  Q2,compute cost(Plan(Q1) ⋈ Plan(Q2)) Cost(Q) = the smallest such cost Plan(Q) = the corresponding plan

Our next example (over)simplifies: Cost of a scan = 0 Cost(P1 ⋈ P2) = Cost(P1) + Cost(P2) + size(intermediate results) Intermediate result size: • 0, if not a join • Else something more realistic Cost(R5) = Cost(R7) = 0 Cost(R5 ⋈ R7) = 0 Size(R5) = Size(R7) = 0 Don’t do this at home

R ⋈ S ⋈ T ⋈ U Number of tuples: R = 2000 S = 5000 T = 3000 U = 1000 Size estimate: size(A ⋈ B) = .01*size(A) *size(B) Unrealistic!

R ⋈ S ⋈ T ⋈ U Number of tuples: R = 2000 S = 5000 T = 3000 U = 1000 Size estimate: size(A ⋈B) = .01*size(A) *size(B) Unrealistic!

Join order options for RSTU • Cost of (RST)U = 60K + 0 + 3M + 0 • Cost of (RSU)T = 20K + 0 + 1M + 0 • Cost of (RTU)S = 20K + 0 + .6M + 0 • Cost of (STU)R = 30K + 0 + 1.5M + 0 • Cost of (RS)(TU) = 0 + 0 + 100K + 30K • Cost of (RT)(SU) = 0 + 0 + 60K + 50K • Cost of (RU)(TS) = 0 + 0 + 20K + 150K

What if we don’t oversimplify? • More realistic size/cost estimations!! (next slides) • Use heuristics to reduce the search space • Consider only left linear trees • No trees with cartesian products: R(A,B) S(B,C) T(C,D) (R ⋈ T) ⋈ S has a cartesian product

Summary of query optimization process so far Discussion in the book is very nice • Parse your query into tree form • Move selections as far down the tree as you can • Project out unwanted attributes as early as you can, when you have their tuples in memory anyway • Pick a good join order, based on the expected size of intermediate results • Pick an implementation for each operation in the tree

Completing a Physical Query Plan

Completing the Physical Query Plan • Choose algorithm to implement each operator Need to consider more than I/O cost: • How much memory do we have ? • Are the input operand(s) sorted ? • Decide for each intermediate result: • Materialize • Pipeline

One option is to materialize intermediate results between operators HashTable Srepeat read(R, x) y  join(HashTable, x)write(V1, y) HashTable  T repeat read(V1, y) z  join(HashTable, y)write(V2, z) HashTable  Urepeat read(V2, z) u  join(HashTable, z)write(Answer, u) Cost = ? Memory = ? ⋈ V2 ⋈ U V1 ⋈ T R S

pipeline The second option is to pipeline between operators HashTable1  SHashTable2  T HashTable3  Urepeat read(R, x) y  join(HashTable1, x) z  join(HashTable2, y) u  join(HashTable3, z)write(Answer, u) Cost = ? Memory = ? ⋈ ⋈ U ⋈ T R S

Example 16.36 Logical plan: Main memory M = 101 blocks of space k blocks U(y,z) 10,000 blocks R(w,x) 5,000 blocks S(x,y) 10,000 blocks

kblocks U(y,z) 10,000 blocks R(w,x) 5,000 blocks S(x,y) 10,000 blocks Example 16.36 Naive evaluation: 2 partitioned hash-joins, materialized (Make sure buckets fit in memory!) Cost 3B(R) + 3B(S) + 4k+ 3B(U) = 75000 + 4k

kblocks U(y,z) 10,000 blocks R(w,x) 5,000 blocks S(x,y) 10,000 blocks Example 16.36 Smarter: • Step 1: hash R on x into 100 buckets, each of 50 blocks; to disk • Step 2: hash S on x into 100 buckets; to disk • Step 3: read each R bucket in memory (50 buffers at a time), join with S (1 buffer at a time); hash result on y into 50 buckets (50 buffers) -- here we pipeline Cost so far: 3B(R) + 3B(S)

kblocks U(y,z) 10,000 blocks R(w,x) 5,000 blocks S(x,y) 10,000 blocks Example 16.36 Continuing: • How large are the 50 buckets on y? k/50 blocks each. • If k <= 50 then keep all 50 buckets in Step 3 in memory, then: • Step 4: read U from disk, hash on y and join in memory • Total cost: 3B(R) + 3B(S) + B(U) = 55,000

CS411 Database Systems