160 likes | 475 Views
24: Merge Join and Sorting. CS347. 2. Three Primary Classes of Join Algorithm . Block-Nested Loops (vs. Simple Nested Loops. ) Merge-Join Hash-Joins. 24: Merge Join and Sorting. CS347. 3. Simple Nested Loops. R join SFor each r ? RFor each s ? Sif pred(r,s) then output result// ig
E N D
1. 24: Merge Join and Sorting CS347 1 Merge Join and Sorting
Objectives:
Detailed relationship between disk and memory/buffer allocation.
External two phase sort.
2. 24: Merge Join and Sorting CS347 2 Three Primary Classes of Join Algorithm Block-Nested Loops (vs. Simple Nested Loops. )
Merge-Join
Hash-Joins
3. 24: Merge Join and Sorting CS347 3 Simple Nested Loops R join S
For each r ? R
For each s ? S
if pred(r,s) then output result
// ignores that data is on disk, read in blocks
4. 24: Merge Join and Sorting CS347 4 Block-Nest Loops - Assumptions Dedicate Most Memory to Outer-Loop
Consider that there are many tuples per page.
5. 24: Merge Join and Sorting CS347 5 Block Nested-Loops Detailed Join(relation R, relation S, buffer_space M){ // M in units of disk pages
Until done( table_scan(S,M-1)){ // Read M-1 pages of R
if equijoin then build index on portion of S in memory
Open(R);
Until done(table_scan(R, 1){
for each tuple r of R in memory{
“find joining tuples of S” // if equijoin use index
“output them” // if thetajoin another loop
}}}}
6. 24: Merge Join and Sorting CS347 6 What to Notice: Outer-loop iterates over “inner” relation
If equi-join, loop structure nested 3 deep
If theta-join, loop structure nested 4 deep
Call structure
not parameterized by theta/equi
theta predicate not passed in
7. 24: Merge Join and Sorting CS347 7 Number of I/O’s (cost) of Block-Nested Loops Run through the complete inner relation, S, for each chunk of R
If S fits in M-1 buffers ? B(R) + B(S)
If S does not fit
? B(S)/M-1 iteration
I/O Cost = (B(S)/M-1) * (M-1 + B(R))
8. 24: Merge Join and Sorting CS347 8 What About Writing the Result Book convention –
no cost in the operator for the result
in some cases, results are immediate input to another operator.
no I/O cost for output
introduce an explicit new operator
“Save intermediate result to disk”
9. 24: Merge Join and Sorting CS347 9 Merge-Join First we will consider equijoins on primary index keys.
create table customers (cid … primary key
… )
create table orders (oid …,
cid … primary key,
…)
10. 24: Merge Join and Sorting CS347 10 Merge-Join Pseudo CodeAssume block structure is buried in iterator abstraction Merge-Join(relation R, relation S, attr rjk, attr sjk ){
r = open(R, 1) // read a block of R
s = open(S, 1) // read a block of S
until( r = EOF or s = EOF) {
if (r.rjk = s.rjk) {output(r,s), r = next(r)}
if (r.rjk < s.rjk) {r = next(r)}
if (r.rjk > s.rjk) {s = next(s)}
}}
11. 24: Merge Join and Sorting CS347 11 Cost of Merge-Join
B(R) + B(S)
12. 24: Merge Join and Sorting CS347 12 What if an equijoin, but not on index key? Is there a secondary index on the join argument?
Do I want to use the secondary index?
13. 24: Merge Join and Sorting CS347 13 Two Phase External Sorting For all intents and purposes, sorting a file, f, bigger than available memory requires
2 * (| f | / B) I/O reads + 2 * (| f | / B) I/O writes
where | f | is the size of the file in bytes,
B is the number of bytes in a block
common notation: | f | / B = || f ||
14. 24: Merge Join and Sorting CS347 14 Merge-Sort (divide and conquer)that considers external storage Phase 1 == Divide
until done
read file until memory full
sort that portion
write result in its own file
Phase 2 == Merge
read 1 block of each file
until done
produce sorted block
write to disk
if a file’s block is exhaused, read another block
How large a file can we sort?
15. 24: Merge Join and Sorting CS347 15 Merge-Sort Cost – No Index 5(B(R) + B(S))
3x to sort
1x (did have to write the sorted results out)
1x for merge sort
May feel high – but it is very predictable
16. 24: Merge Join and Sorting CS347 16 Many Opportunities to be Clever what if 1 relation fits in memory?
sort is 1*B(R)
integrate merge-join with second phase of sort