270 likes | 279 Views
This article explores methods for optimizing the performance of SQL queries by resolving and optimizing the SQL query execution process. It discusses how SQL queries are translated into relational algebra expressions and how to explore alternative forms of a query using logical query plans. The article also covers the implementation of relational algebra operators and physical query plans, including table-scan, index-scan, and sort-scan. It provides insights into computational models and iterators for implementing physical operators.
E N D
Query Execution Optimizing Performance
Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional processing to supply all of the missing details. In practice, an SQL query is translated internally into a relational algebra expression. One advantage of using relational algebra is that it makes alternative forms of a query easier to explore. The different algebraic expressions for a query are called logical query plans. We will focus first on the methods for executing the operations of the relational algebra. Then we will focus on how transform logical query plans.
Preview • Parsing: read SQL, output relational algebra tree • Query rewrite: Transform tree to a form, which is more efficient to evaluate • Physical plan generation: select implementation for each operator in tree, and for passing results up the tree. • In this chapter we will focus on the implementation for each operator.
Relational Algebra recap RA union, intersection and difference correspond to UNION, INTERSECT, and EXCEPT in SQL Selection corresponds to the WHERE-clause in SQL Projection corresponds to SELECT-clause Product corresponds to FROM-clause Join’s corresponds to JOIN, NATURAL JOIN, and OUTER JOIN in the SQL2 standard Duplicate elimination corresponds to DISTINCT in SELECT-clause Grouping corresponds to GROUP BY Sorting corresponds to ORDER BY
A graphical picture πtitle, birthdate σyear=1996 AND gender = “F” AND starName = name × MovieStar StarsIn SELECT title, birthdate FROM MoviewStar, StarsIN WHERE year=1996 AND gender='F' AND starName=name; A query plan is a decomposition of the original SQL into an expression tree. What is the initial plan? Is it the best? 5
A better plan πtitle, birthdate starName = name σgender = “F” σyear=1996 MovieStar StarsIn 6
Relational algebra for real SQL Keep in mind the following fact: A relation in algebra is a set, whilea relation in SQL is probably a bag In short, a bag allows duplicates. Not surprisingly, this can effect the cost of related operations.
Bag union, intersection, and difference • Card(t,R) means the number of occurrences of tuple t in relation R • Card(t, RS) = Card(t,R) + Card(t,S) • Card(t,RS) = min{Card(t,R), Card(t,S)} • Card(t,R–S) = max{Card(t,R)–Card(t,S), 0} • Example: R= {A,B,B}, S = {C,A,B,C} • R S = {A,A,B,B,B,C,C} • R S = {A,B} • R – S = {B}
Physical operators • Physical query plans are built from physical operators. • In most cases, the physical operators are direct implementations of the relational algebra operators. • However, there are several other physical operators for various supporting tasks. • Table-scan (the most basic operation we want to perform in a physical query plan) • Index-scan (E.g. if we have an index on some relation R we can retrieve the blocks of R by using the index) • Sort-scan (takes a relation R and a specification of the attributes on which the sort is to be made, and produces R in sorted order)
Computational Model • When comparing algorithms for the same operations we do not consider the cost of writing the output. • Because the cost of writing the output on the disk depends on the size of the result, not on the way the result was computed. In other words, it is the same for any computational alternative. • Also, we can often pipeline the result to other operators when the result is constructed in main memory. So a final output phase may not even be required.
Iterators for Implementation of Physical Operators • This is a group of three functions that allow a consumer of the result of a physical operation to get the result one tuple at a time. • An iterator consists of three parts: • Open: Initializes data structures. Doesn’t return tuples • GetNext: Returns next tuple & adjusts the data structures • Close: Cleans up afterwards • We assume these to be overloaded names of methods.
Iterator for tablescan operator Open(R) { b := the first block of R; t := the first first tuple of block b; Found := TRUE; } GetNext(R) { IF (t is past the last tuple on block b) { increment b to the next block; IF (there is no next block) { Found := FALSE; RETURN; } ELSE /*b is a new block*/ t := first tuple on block b; } oldt := t; /*Now we are ready to return t and increment*/ increment t to the next tuple of b; RETURN oldt; } Close(R) {}
Iterator for Bag Union of R and S Open(R,S) { R.open(); CurRel := R; } GetNext(R,S) { IF (CurRel = R) { t := R.GetNext(); IF(Found) /*R is not exhausted*/ RETURN t; ELSE /*R is exhausted*/ { S.Open(); CurRel := S; } } /*Here we read from S*/ RETURN S.GetNext(); /*If s is exhausted Found will be set to FALSE by S.GetNext */ } Close(R,S) { R.Close(); S.Close() }
Iterator for sort-scan • In an iterator for sort-scan • Open has to do all of 2PMMS, except the merging • GetNext outputs the next tuple from the merging phase
Algorithms for implementing RA-operators • Classification of algorithms • Sort based methods • Hash based methods • Index based methods • Degree of difficultness of algorithms • One pass (when one relation can fit into main memory) • Two pass (when no relation can fit in main memory, but again the relations are not very extremely large) • Multi pass (when the relations are very extremely large) • Classification of operators • Tuple-at-a-time, unary operations(s, p) • Full-relation, unary operations (d, g) • Full-relation, binary operations (union, join,…)
Cost parameters In order to evaluate or estimate the cost of query resolution, we need to identify the relevant parameters. Typical cost parameters include: R = the relation on disk M = number of main memory buffers available (1buffer = 1block) B(R) = number of blocks of R T(R) = number of tuples of R V(R, a) = number of distinct values in column a of R V(R, L) = number of different tuples in R (where L is a list of attributes or columns) Simple cost estimate: Basic scan: B(R) 2PMMS: 3B(R) Recall that final output is not counted
One pass, tuple-at-a-time • Selection and projection • Cost = B(R) or T(R) (if the relation is not clustered) • Space requirement: M 1 block • Principle: • Read one block (or one tuple if the relation is not clustered) at a time • Filter in or out the tuples of this block.
One pass, unary full-relation operations • Duplicate elimination: for each tuple decide: • seen before: ignore • new: output • Principle: • It is the first time we have seen this tuple, in which case we copy it to the output. • We have seen the tuple before,in which case we must not output this tuple. • We need a Main Memory hash-table to be efficient. • Requirement:
One pass, unary full-relation operations Grouping: Accumulate the information on groups in main memory. • Need to keep one entry for each value of the grouping attributes, through a main memory search structure (hash table). • Then, we need for each group to keep an aggregated value (or values if the query asks for more than one aggregation). • For MIN/MAX we keep the min or max value seen so far for the group. • For COUNT aggregation we keep a counter which is incremented each time we encounter a tuple belonging to the group. • For SUM, we add the value if the tuple belongs to the group. • For AVG? MM requirement. • Typically, a (group) tuple will be smaller than a tuple of the input relation, • Typically, the group number will be smaller than the number of tuples in the input relation. This is their number: • How you would do an iterator for grouping?
One pass, binary operators • Requirement: min(B(R),B(S)) ≤M • Exception: bag union • Cost: B(R) + B(S) • Assume R is larger than S. How to perform the operations below: • Set union, set intersection, set difference • Bag intersection, bag difference • Cartesian product, natural join • All these operators require reading the smaller of the relations into main memory using there a search scheme (e.g. main memory hash table) for easy search and insertion.
Set Union • Let R and S be sets. • We read S into M-1 buffers of main memory. • All these tuples are also copied to the output. • We then read each block of R into the Mth buffer, one at a time. • For each tuple t of R we see if t is in S, and if not, we copy t to output.
Set Intersection • Let R and S be sets or bags. • The result will be set. • We read S into M-1 buffers of main memory. • We then read each block of R into the M-th buffer, one at a time. • For each tuple t of R we see if t is in S, and if so, we copy t to output. At the same time we delete t from S in Main Memory.
Set Difference • Let R and S be sets. • Since difference is not a commutative operator, we must distinguish between R-S and S-R. • Read S into M-1 buffers of main memory. • Then read each block of R into the Mth buffer, one at a time. • To compute R-S: • for each tuple t of R we see if t is not in S, and if so, we copy t to output. • To compute S-R: • for each tuple t of R we see if t is is in S, we delete t from S in such a case. At the end we output those tuples of S that remain.
Bag Intersection • Let R and S be bags. • Read S into M-1 buffers of main memory. • Also, associate with each tuple a count, which initially measures the number of times the tuple occurs in S. • Then read each block of R into the M-th buffer, one at a time. • For each tuple t of R we see if t is in S. If not we ignore it. • Otherwise, if the counter is more than zero, we output t and decrement the counter.
Bag Difference • We read S into M-1 buffers of main memory. • Also, we associate with each tuple a count, which initially measures the number of times the tuple occur in S. • We then read each block of R into the M-th buffer, one at a time. • To compute S-R: • for each tuple t of R we see if t is is in S, we decrement its counter. • At the end we output those tuples of S that remain with counter positive. • To compute R-S: • we may think of the counter c for tuple t as having c reasons to not output t. • Now, when we process a tuple of R we check to see if that tuple appears in S. If not we output t. • Otherwise, we check to see the counter c of t. If it is 0 we output t. • If not, we don’t output t, and we decrement c.
Product • We read S into M-1 buffers of main memory. No special structure is needed. • We then read each block of R into the M-th buffer, one at a time. And combine each tuple with all the tuples of S.
Natural Join • We read S into M-1 buffers of main memory and build a search structure where the search key is the shared attributes of R and S. • We then read each block of R into the M-th buffer, one at a time. For each tuple t of R we see if t is in S, and if so, we copy t to output.