Quick Review of Apr 22 material

Quick Review of Apr 22 material • Sections 13.1 through 13.3 in text • Query Processing: take an SQL query and: • parse/translate it into an internal representation • optimize it (choose an efficient form for the query) • evaluate it • Metadata for query processing • Operations (and their costs): • Sel for equality (one particular value) • Range selection • projection • all with and without indices • Complex selections

Sorting? • What has sorting to do with query processing? • SQL queries can specify the output be sorted • several relational operations (such as joins) can be implemented very efficiently if the input data is sorted first • as a result, query processing is often concerned with sorting temporary (intermediate) and final results • creating a secondary index on the active relation (“logical” sorting) isn’t sufficient -- sequential scans through the data on secondary indices are very inefficient. We often need to sort the data physically into order

Sorting • We differentiate two types of sorting: • internal sorting: the entire relation fits in memory • external sorting: the relation is too large to fit in memory • Internal sorting can use any of a large range of well-established sorting algorithms (e.g., Quicksort) • In databases, the most commonly used method for external sorting is the sort-merge algorithm. (based upon Mergesort)

Sort-merge Algorithm • create runs phase. • Load in M consecutive blocks of the relation (M is number of blocks that will fit easily in main memory) • Use some internal sorting algorithm to sort the tuples in those M blocks • Write the sorted run to disk • Continue with the next M blocks, etcetera, until finished • merge runs phase (assuming that the number of runs, N, is less than M) • load the first block of each run into memory • grab the first tuple (lowest value) from all the runs and write it to an output buffer page • when the last tuple of a block is read, grab a new block from that run • when the output buffer page is full, write it to disk and start a new one • continue until all buffer pages are empty

Sort-merge Algorithm (2) • Merge-runs phase (N>M) • operate on M runs at a time, creating runs of length M2, and continue in multiple passes of the Merge operation • Cost of sorting: b(r ) is the number of blocks occupied by relation r • runs phase does one read, one write on each block of r: cost 2b(r ) • total number of runs (N): b(r )/M • number of passes in merge operation: 1 if N<M; otherwise logM-1(b(r )/M) • during each pass in the merge phase we read the whole relation and write it all out again: cost 2b(r ) per pass • total cost of merge phase is therefore 2b(r ) (logM-1(b(r )/M)+1) • if only one merge pass is required (N<M) the cost is 4b(r ); • if M>b(r ) then there is only one run (internal sorting) and the cost is b(r )

Join Operation • Join is perhaps the most important operation combining two relations • Algorithms computing the join efficiently are crucial to the optimization phase of query processing • We will examine a number of algorithms for computing joins • An important metric for estimating efficiency is the size of the result: as mentioned last class, the best algorithms on complex (multi-relation) queries is to cut down the size of the intermediate results as quickly as possible.

Join Operation: Size Estimation • 0 <= size <= n(r ) * n(s) (between 0 and size of cartesian product) • If A = R S is a key of R, then size <= n(s) • If A = R S is a key of R and a foreign key of S, then size = n(s) • If A = R S is not a key, then each value of A in R appears no more than n(s)/V(A,s) times in S, so n(r ) tuples of R produce: size <= n(r ) *n(s)/V(A,s) symmetrically, size <= n(s) *n(r)/V(A,r) if the two values are different we use: size <=min{n(s)*n(r)/V(A,r), n(r)*n(s)/V(A,s)}

Join Methods: Nested Loop • Simplest algorithm to compute a join: nested for loops • requires no indices • tuple-oriented: for each tuple t1 in r do begin for each tuple t2 in s do begin join t1, t2 and append the result to the output • block-oriented: for each block b1 in r do begin for each block b2 in s do begin join b1, b2 and append the result to the output • reverse inner loop • as above, but we alternate counting up and down in the inner loop. Why?

Cost of Nested Loop Join • Cost depends upon the number of buffers and the replacement strategy • pin 1 block from the outer relation, k for the inner and LRU cost: b(r ) + b(r )*b(s) (assuming b(s)>k) • pin 1 block from the outer relation, k for the inner and MRU cost: b(r ) + b(s) + (b(s) - (k-1))*(b(r )-1) = b(r )(2-k) + k + 1 + b(r )*b(s) (assuming b(s)>k) • pin k blocks from the outer relation, 1 for the inner • read k from the outer (cost k) • for each block of s join 1xk blocks (cost b(s)) • repeat with next k blocks of r untildone (repeated b(r )/k times) cost: (k+b(s)) * b(r )/k =b(r ) + b(r )*b(s)/k • which relation should be the outer one?

Join Methods: Sort-Merge Join • Two phases: • Sort both relations on the join attributes • Merge the sorted relations sort R on joining attribute sort S on joining attribute merge (sorted-R, sorted-S) • cost with k buffers: • b(r) (2 log( b(r)/k) +1) to sort R • b(s) (2 log( b(s)/k) +1) to sort S • b(r ) + b(s) to merge • total: b(r) (2 log(b(r)/k) +1) + b(s) (2 log( b(s)/k) +1) +b(r) + b(s)

Join Methods: Hash Join • Two phases: • Hash both relations into hashed partitions • Bucket-wise join: join tuples of the same partitions only Hash R on joining attribute into H(R) buckets Hash S on joining attribute into H(S) buckets nested-loop join of corresponding buckets • cost (assuming pairwise buckets fit in the buffers) • 2b(r ) to hash R (read and write) • 2b(s) to hash S (as above) • b(r ) + b(s) to merge • total: 3(b(r ) + b(s))

Join Methods: Indexed Join • Inner relation has an index (clustering or not): for each block b(r ) in R do begin for each tuple t in b(r ) do begin search the index on S with the value t.A of the joining attribute and join with the resulting tuples of S • cost = b(r ) + n(r ) * cost(select(S.A=c)) where cost(select(S.A=c)) is as described before for indexed selection • What if R is sorted on A? (hint: use V(A,r) in the above)

3-way Join Suppose we want to compute R(A,B) |X| S(B,C) |X| T(C,D) • 1st method: pairwise. • First compute temp(A,B,C) = R(A,B) |X| S(B,C) cost b(r ) + b(r )*b(s) size of temp b(temp) = n(r )*n(s)/(V(B,S)/f(r+s)) • then compute result temp(A,B,C) |X| T(C,D) cost b(t)+b(t)*b(temp) • 2nd method: scan S and do simultaneous selections on R and T • cost = b(s) + b(s)* (b(r ) + b(t)) • if R and T are indexed we could do the selections through the indices cost = b(s) + n(s ) *[ cost(select(R.B=S.B)) + cost(select(T.C=S.C))]

Quick Review of Apr 22 material

Quick Review of Apr 22 material

Presentation Transcript

Quick Review of Apr 10 material

Quick Review of Experiment

22 Apr 2009

Monday Apr 22, 2013

Quick Review of Apr 29 material

Quick Review of Terms

Quick Review

Quick Review of Basics

Quick Review of Apr 10 material

Quick review

Quick Review

Time Meeting Apr 22 , 2014

Quick Review of Apr 17 material

Quick Review

Apr. 22

22 Apr 2009

Quick review of SQL

Quick Review:

QUICK REVIEW

Apr 22, 2015

Quick Review of Probability

Trademark Objections -apr 22