130 likes | 256 Views
Quick Review of Apr 22 material . Sections 13.1 through 13.3 in text Query Processing: take an SQL query and: parse/translate it into an internal representation optimize it (choose an efficient form for the query) evaluate it Metadata for query processing Operations (and their costs):
E N D
Quick Review of Apr 22 material • Sections 13.1 through 13.3 in text • Query Processing: take an SQL query and: • parse/translate it into an internal representation • optimize it (choose an efficient form for the query) • evaluate it • Metadata for query processing • Operations (and their costs): • Sel for equality (one particular value) • Range selection • projection • all with and without indices • Complex selections
Sorting? • What has sorting to do with query processing? • SQL queries can specify the output be sorted • several relational operations (such as joins) can be implemented very efficiently if the input data is sorted first • as a result, query processing is often concerned with sorting temporary (intermediate) and final results • creating a secondary index on the active relation (“logical” sorting) isn’t sufficient -- sequential scans through the data on secondary indices are very inefficient. We often need to sort the data physically into order
Sorting • We differentiate two types of sorting: • internal sorting: the entire relation fits in memory • external sorting: the relation is too large to fit in memory • Internal sorting can use any of a large range of well-established sorting algorithms (e.g., Quicksort) • In databases, the most commonly used method for external sorting is the sort-merge algorithm. (based upon Mergesort)
Sort-merge Algorithm • create runs phase. • Load in M consecutive blocks of the relation (M is number of blocks that will fit easily in main memory) • Use some internal sorting algorithm to sort the tuples in those M blocks • Write the sorted run to disk • Continue with the next M blocks, etcetera, until finished • merge runs phase (assuming that the number of runs, N, is less than M) • load the first block of each run into memory • grab the first tuple (lowest value) from all the runs and write it to an output buffer page • when the last tuple of a block is read, grab a new block from that run • when the output buffer page is full, write it to disk and start a new one • continue until all buffer pages are empty
Sort-merge Algorithm (2) • Merge-runs phase (N>M) • operate on M runs at a time, creating runs of length M2, and continue in multiple passes of the Merge operation • Cost of sorting: b(r ) is the number of blocks occupied by relation r • runs phase does one read, one write on each block of r: cost 2b(r ) • total number of runs (N): b(r )/M • number of passes in merge operation: 1 if N<M; otherwise logM-1(b(r )/M) • during each pass in the merge phase we read the whole relation and write it all out again: cost 2b(r ) per pass • total cost of merge phase is therefore 2b(r ) (logM-1(b(r )/M)+1) • if only one merge pass is required (N<M) the cost is 4b(r ); • if M>b(r ) then there is only one run (internal sorting) and the cost is b(r )
Join Operation • Join is perhaps the most important operation combining two relations • Algorithms computing the join efficiently are crucial to the optimization phase of query processing • We will examine a number of algorithms for computing joins • An important metric for estimating efficiency is the size of the result: as mentioned last class, the best algorithms on complex (multi-relation) queries is to cut down the size of the intermediate results as quickly as possible.
Join Operation: Size Estimation • 0 <= size <= n(r ) * n(s) (between 0 and size of cartesian product) • If A = R S is a key of R, then size <= n(s) • If A = R S is a key of R and a foreign key of S, then size = n(s) • If A = R S is not a key, then each value of A in R appears no more than n(s)/V(A,s) times in S, so n(r ) tuples of R produce: size <= n(r ) *n(s)/V(A,s) symmetrically, size <= n(s) *n(r)/V(A,r) if the two values are different we use: size <=min{n(s)*n(r)/V(A,r), n(r)*n(s)/V(A,s)}
Join Methods: Nested Loop • Simplest algorithm to compute a join: nested for loops • requires no indices • tuple-oriented: for each tuple t1 in r do begin for each tuple t2 in s do begin join t1, t2 and append the result to the output • block-oriented: for each block b1 in r do begin for each block b2 in s do begin join b1, b2 and append the result to the output • reverse inner loop • as above, but we alternate counting up and down in the inner loop. Why?
Cost of Nested Loop Join • Cost depends upon the number of buffers and the replacement strategy • pin 1 block from the outer relation, k for the inner and LRU cost: b(r ) + b(r )*b(s) (assuming b(s)>k) • pin 1 block from the outer relation, k for the inner and MRU cost: b(r ) + b(s) + (b(s) - (k-1))*(b(r )-1) = b(r )(2-k) + k + 1 + b(r )*b(s) (assuming b(s)>k) • pin k blocks from the outer relation, 1 for the inner • read k from the outer (cost k) • for each block of s join 1xk blocks (cost b(s)) • repeat with next k blocks of r untildone (repeated b(r )/k times) cost: (k+b(s)) * b(r )/k =b(r ) + b(r )*b(s)/k • which relation should be the outer one?
Join Methods: Sort-Merge Join • Two phases: • Sort both relations on the join attributes • Merge the sorted relations sort R on joining attribute sort S on joining attribute merge (sorted-R, sorted-S) • cost with k buffers: • b(r) (2 log( b(r)/k) +1) to sort R • b(s) (2 log( b(s)/k) +1) to sort S • b(r ) + b(s) to merge • total: b(r) (2 log(b(r)/k) +1) + b(s) (2 log( b(s)/k) +1) +b(r) + b(s)
Join Methods: Hash Join • Two phases: • Hash both relations into hashed partitions • Bucket-wise join: join tuples of the same partitions only Hash R on joining attribute into H(R) buckets Hash S on joining attribute into H(S) buckets nested-loop join of corresponding buckets • cost (assuming pairwise buckets fit in the buffers) • 2b(r ) to hash R (read and write) • 2b(s) to hash S (as above) • b(r ) + b(s) to merge • total: 3(b(r ) + b(s))
Join Methods: Indexed Join • Inner relation has an index (clustering or not): for each block b(r ) in R do begin for each tuple t in b(r ) do begin search the index on S with the value t.A of the joining attribute and join with the resulting tuples of S • cost = b(r ) + n(r ) * cost(select(S.A=c)) where cost(select(S.A=c)) is as described before for indexed selection • What if R is sorted on A? (hint: use V(A,r) in the above)
3-way Join Suppose we want to compute R(A,B) |X| S(B,C) |X| T(C,D) • 1st method: pairwise. • First compute temp(A,B,C) = R(A,B) |X| S(B,C) cost b(r ) + b(r )*b(s) size of temp b(temp) = n(r )*n(s)/(V(B,S)/f(r+s)) • then compute result temp(A,B,C) |X| T(C,D) cost b(t)+b(t)*b(temp) • 2nd method: scan S and do simultaneous selections on R and T • cost = b(s) + b(s)* (b(r ) + b(t)) • if R and T are indexed we could do the selections through the indices cost = b(s) + n(s ) *[ cost(select(R.B=S.B)) + cost(select(T.C=S.C))]