660 likes | 860 Views
CS 440 Database Management Systems. Query Optimization. Today ’ s lecture. Past lectures. DBMS Architecture. User/Web Forms /Applications/ DBA. query. transaction. Query Parser. Transaction Manager. Query Rewriter. Logging & Recovery. Query Optimizer. Lock Manager.
E N D
CS 440 Database Management Systems Query Optimization
Today’s lecture Past lectures DBMS Architecture User/Web Forms/Applications/DBA query transaction Query Parser Transaction Manager Query Rewriter Logging & Recovery Query Optimizer Lock Manager Query Executor Files & Access Methods Lock Tables Buffers Buffer Manager Main Memory Storage Manager Storage
Query Rewriting/Optimization • Parse SQL into a logical query plan. • More than one logical plan! • Pick the fastest -> optimization • Convert the logical query plan to a physical query plan. • More than one physical plan! • Pick the fastest -> optimization
Logical Query Plan • A tree whose nodes are relational algebra operators. PA1,…,An(sC (R1 ∞ R2 ∞ … ∞ Rk)) SELECTA1, …, An FROMR1, …, Rk WHEREC A1,…,An sC Rk … R1
Logical Query Plan • A tree whose nodes are extended relational algebra operators. A1,…,An SELECT A1, …, An, Func FROMR1, …, Rk WHEREC GROUP BY Ai, …, Aj PA1,…,An(gAi, …, Aj, Func(sC (R1 ∞ … ∞ Rk))) gAi, …, Aj, Func sC Rk … R1
SQL -> Logical Query Plan • Straightforward for simple SQL queries. • Map select-from-where to projection-join-selection • Subquery in where clause • Remove it!
Example: Removing Subqueries StarsIn(title, year, starName) Star(name, address, gender, birthdate) • More examples in 16.3.2 SELECT title FROMStarsIn WHEREstarNameIN ( SELECT name FROM Star WHERE birthdate LIKE ‘%1960’) SELECT title FROMStarsIn, Star WHEREstarName = name AND birthdate LIKE ‘%1960’
Query Rewriting • Different logical plans with the same meanings • Same outputs for all possible databases Star(name, address, gender, birthdate) sgender=‘male’ OR name=‘waltz’ U sgender=‘male’ sname = ‘waltz’ Star Generally Faster Star Star
Algebraic Laws • The ways that we can rewrite a logical plan. • Associative and communicative laws R ∞ S = S ∞R, R ∞(S ∞T) = (R ∞S) ∞ T R U S = S U R, R U (S U T) = (R U S) U T R S = S R, R (S T) = (R S) T • Distributive laws R ∞ (S U T) = (R ∞ S) U (R ∞ T) U U U U U U
Algebraic Laws for Selection U • C AND C’(R) = sC(sC’(R)) = sC(R) sC’(R) • C OR C’(R) = sC(R) U sC’(R) • C (R ∞ S) = sC (R) ∞ S • C includes only attributes of R • What if it includes attributes of R and S? sC (R – S) = sC (R) – S sC (R US) = sC (R) UsC (S) sC (R S) = sC (R) S U U
Example: Algebraic Laws for Selection • R(A, B, C, D), S(E, F) • sF=1 (R ∞ A=E S) = ? • sB=2 AND F=3 (R ∞ A=E S) = ?
Algebraic Laws for Projection U PM(PN(R)) = PM N(R) PM(R ∞ S) = PN(PP(R) ∞PQ(S)) • N, P, Q are appropriate subsets of M • Example R(A,B,C,D), S(E, F) • PB,C,F(R ∞A=E S) = P ? (P?(R) ∞P?(S))
Push Selection Down Uses sC (R ∞ S) = sC (R) ∞ S If the selection is executed, fewer tuples will be processes afterwards. It may cause to lose desired ordering, if we use indexes title title sgender=‘male’ AND year=1950 starName=name starName=name gender=‘male’ year=1950 Star StarsIn Star StarsIn
Push Selection Up and Down StarsIn(title, year, starName) Movie(title, year, genre, producer) starName Movie.title=StarsIn.title AND Movie.year=StarsIn.year StarsIn year=1940 Movie
Push Selection Up and Down StarsIn(title, year, starName) Movie(title, year, genre, producer) starName starName syear =1940 title=title AND year=year StarsIn year=1940 Movie.title=StarsIn.title AND Movie.year = StarsIn.year Movie StarsIn Movie
Push Selection Up and Down StarsIn(title, year, starName) Movie(title, year, genre, producer) starName starName starName syear =1940 title=title AND year=year Movie.title=StarsIn.title AND Movie.year=StarsIn.year StarsIn year=1940 title=title AND year=year year=1940 year=1940 Movie StarsIn Movie Movie StarsIn
Push Projection Down StarsIn(title, year, starName) Movie(title, year, genre, producer) title title StarsIn.title=Movie.title StarsIn.title=Movie.title title title Movie StarsIn Movie StarsIn We keep the notation at the top. Less effective than pushing down selection.
Push Projection Down It is not always possible. title sstarName=‘Waltz’ AND producer=‘Tarantino’ AND year=2012 StarsIn.title=Movie.title Movie StarsIn Be careful when there are set operations or aggregation functions in the tree!
Push Duplicate Elimination Down δ syear =1940 syear =1940 StarsIn.title=Movie.title StarsIn.title=Movie.title δ δ Movie StarsIn Movie StarsIn It is not always possible, e.g. aggregation functions. Read the book for algebraic laws of duplicate elimination.
Remove Unnecessary Duplicate Elimination δ syear > 1940 Uset title,year ttle,year StarsIn Movie
Remove Unnecessary Duplicate Elimination syear > 1940 δ δ syear > 1940 Uset Uset title,year ttle,year title,year ttle,year StarsIn Movie StarsIn Movie
Remove Unnecessary Duplicate Elimination syear > 1940 δ δ syear > 1940 syear > 1940 Uset Uset Uset title,year ttle,year title,year ttle,year title,year ttle,year StarsIn Movie StarsIn Movie StarsIn Movie
Query Optimization • 16.5
Optimization Strategies • Optimal approach • Enumerate each possible plan • Measure its performance by running it • Pick the fastest one • Heuristics approach • Use some fixed heuristics • e.g. always nested loop joins • e.g. order relations from smallest to largest • Selinger/ System R optimization
Cost-based Optimization • Plan Space • What is the space of query plans? • Cost estimation • How to estimate cost of each plan, without executing it? • Search Algorithms • How to search the space, as guided by cost estimates
Query Plans • Order of the associative and communicative operators • Join, union, …. • The algorithms for operators • Hash-join, merge-join, … • Implicit operators • Full-scan, index-scan, … • Communication between operators • Pipelining versus materializing • “Interesting order”
Join Order • Many join algorithms are asymmetric • Nested loop • S: outer relation, R: inner relation • S’s tuples are examined against R’s • Index-based join • S has index • S’s tuples are examined against R’s • Left relation: build relation • Right relation: probe relation
Join Order Many possible query plans (join trees) for R ∞ S ∞T ∞ U: BushyLeft deep U T R S T U R S R Right deep S T U
Join Ordering Problem • Given: R1 ⋈ R2 ⋈ … ⋈ Rn • We have a cost function that computes the cost of every join tree. • Find the most efficient join tree for the query. • Dynamic programming: try to optimize partial plans instead of the complete plan.
Dynamic Programming • Find the best plan for each subset of {R1, …, Rn} • Use the information to compute the best plan for larger subsets • Step 1: best plans for {R1}, {R2}, …, {Rn} • Step 2: best plans for {R1,R2}, {R1,R3}, …, {Rn-1, Rn} • … • Step n: best plan for {R1, …, Rn} • A bottom-up strategy
Dynamic Programming • For each subset S ⊆ {R1, …, Rn} calculate: • The estimated size of S: Size(S) • Plan(S): A best plan for S • Cost(S): The cost of Plan(S) • The cost of a plan is the size of intermediate relations in the plan. • We can use other types of costs.
Dynamic Programming • Step 1: For each {Ri}: • Size({Ri}) = B(Ri) • Plan({Ri}) = Ri • Cost({Ri}) = 0. • Step 2: For each {Ri, Rj}: • Size({Ri,Rj}) = estimate of the size of join (more later) • Plan({Ri,Rj}) = Ri∞ Rjor Rj∞Ri • Cost = 0 (no intermediate relations)
Dynamic Programming • Step i: For each S ⊆ {R1, …, Rn} of cardinality i do: • Compute Size(S) • For every S1 ,S2s.t. S = S1S2C = cost(S1) + cost(S2) + size(S1) + size(S2) • If Si is a base table, ignore its size. • Cost(S) = the smallest C • Plan(S) = the corresponding plan • Return Plan({R1, …, Rn})
Dynamic Programming: Example • Cost(R ⋈S) = 0 (no intermediate results) • Cost((R ⋈S) ⋈T) = Cost(R ⋈S) + Cost(T) + size(R ⋈S) [+ Size(T)] = size(R ⋈S)
Dynamic Programming: Example • Relations: R, S, T, U • Number of tuples: 2000, 5000, 3000, 1000 • A toy size estimation method: • Size (A ⋈ B) = 0.01*T(A)*T(B)
Dynamic Programming: Reducing Search Space • Exponential computation! • Consider only left-deep join trees • Ignore join trees with Cartesian product • The size of a Cartesian product is generally much larger than (natural) joins. • Example: R(A,B), S(B,C), U(C,D)(R ∞ U) ∞S has a Cartesian product pick (R ∞ S) ∞ U instead
Cost Estimation • Estimating the cost of a physical plan • The cost of a physical operators depend mainly on the size of its inputs • Sort-merge join for R ∞ S : 3 B(R) + 3 B(S) • In many queries the inputs of operators are intermediate results • Sort-merge join for (R ∞ S) ∞ T • 3 B(T) + 3 B(R ∞ S) + 3 B(R) + 3 B(S)
Size Estimation: Base Tables • Cost factors in query execution • B(R): Number of blocks in R • T(R): Number of tuples in R • V(R,A): Number of distinct values of attribute A in R
Size Estimation: Base Tables • B(R) = T(R ) / block size • T(R): scan the table • Update T(R) • Whenever DBA asks • Periodically
Size Estimation: Base Tables • V(R,A): too much information • Use histogram, …
Size Estimation: Intermediate Results • Estimating the size of a projection • Projection does not eliminate duplicates: T(PA(R)) = T(R)
Size Estimation: Selection • Equality: E = sA=a(R) • T(E) ranges from 0 to T(R) – V(R,A) + 1 • Consider its mean: T(E) = T(R)/V(R,A) • Inequality: E = sA<a(R) • T(E) ranges from 0 to T(R) • Rule of thumb: T(E) = T(R)/3 • Not equal: E = sA<>a(R) • Rule of thumb: T(E) = T(R)
Size Estimation: Selection • ANDin the condition: • R(A,B), E = sA=1 AND B<10(R) • Multiply: • 1/V(R,A) for equality, • 1/3 for inequality • 1 for not equal • T(R) = 10,000, V(R,A) = 50, T(E) = ? • Similar method for OR: • Read example 16.22
A A A Size Estimation: Join • Natural join, R S • Simple cases • The set of A values in R and S are disjoint: T(R S) = 0 • A is a key in S and a foreign key in R: T(R S) = T(R)