1 / 32

Equality Join R X R.A=S.B S

Equality Join R X R.A=S.B S. Pr records per page. Ps records per page. M Pages . N Pages . :. :. Relation S. Relation R. Simple Nested Loops Join. Memory. For each tuple r in R do for each tuple s in S do if r.A == S.B then add <r, s> to result. 1 page for R.

galen
Download Presentation

Equality Join R X R.A=S.B S

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Equality Join R XR.A=S.B S Pr records per page Ps records per page M Pages N Pages : : Relation S Relation R

  2. Simple Nested Loops Join Memory For each tuple r in R do for each tuple s in S do if r.A == S.B then add <r, s> to result 1 page for R 1 page for S • Scan the outer relation R. • For each tuple r in R, scan the entire inner relation S. • Ignore CPU cost and cost for writing result to disk R has M pages with PR tuples per page. S has N pages with PS tuples per page. Nested loop join cost = M+M*PR*N. Cost to scan R. Output

  3. Exercise • What is the I/O cost for using a simple nested loops join? • Sailors: 1000 pages, 100 records/page, 50 bytes/record • Reserves: 500 pages, 80 records/page, 40 bytes/record • Ignore the cost of writing the results • Join selectivity factor: 0.1

  4. Exercise • What is the I/O cost for using a simple nested loops join? • Sailors: 1000 pages, 100 records/page, 50 bytes/record • Reserves: 500 pages, 80 records/page, 40 bytes/record • Ignore the cost of writing the results • Join selectivity factor: 0.1 Answer • If Sailors is the outer relation, • I/O cost = 1000 + 100,000*500 = 50,001,000 I/Os • If Reserves is the outer relation, • I/O cost = 500 + 40,000*1000 = 40,000,500 I/Os

  5. Block Nested Loops Join Memory • For each ri of B-2 pages of R do • For each sj of s in S do • if ri.a == sj.a then • add <ri, sj> to result B-2 pages for R R Bring bigger relation S one page at a time. 1 page for S Cost=M+ * N 1 page Number of blocks of R for each retrieval Output buffer B: Available memory in pages. Optimal if one of the relation can fit in the memory (M=B-2). Cost=M+N

  6. Exercise • What is the I/O cost for using a block nested loops join? • Sailors: 1000 pages, 100 records/page, 50 bytes/record • Reserves: 500 pages, 80 records/page, 40 bytes/record • Ignore the cost of writing the results • Memory available: 102 Blocks

  7. Exercise • What is the I/O cost for using a block nested loops join? • Sailors: 1000 pages, 100 records/page, 50 bytes/record • Reserves: 500 pages, 80 records/page, 40 bytes/record • Ignore the cost of writing the results • Memory available: 102 Blocks • Answer • Sailors is the outer relation • #times to bring in the entire Reserves • I/O cost = 500 + 1000*5=5500

  8. Indexed Nested Loops Join for each tuple r in R do for each tuple s in S do if ri.A == sj.B then add <r, s> to result Use index on the joining attribute of S. Cost=M+M*PR*(Cost of retrieving a matching tuple in S). Depend on the type of index and the number of matching tuples. cost to scan R

  9. What is the I/O cost for using an indexed nested loops join? • Ignore the cost of writing the results • Memory available: 102 Blocks • There is only a hash-based, dense index on sid using Alternative 2 • Answer • Sailors is the outer relation • I/O Cost = 500 + 40,000*(1.2+1)=88500 I/Os Exercise

  10. Original Relation Partitions OUTPUT 1 1 2 INPUT 2 hash function h . . . B-1 B-1 B main memory buffers Disk Disk Partitions of R & S Join Result Hash table for partition Ri (k <= B-2 pages) hash fn h2 h2 Output buffer Input buffer for Si B main memory buffers Disk Disk Grace Hash-Join • Partition Phase • Partition both relations using hash fn h: • R tuples in partition i will only match S tuples in partition i. Join (probing) Phase • Read in a partition of R, hash it using h2 (<> h!). • Scan matching partition of S to search for matching tuples

  11. Cost of Grace Join • Assumption: • Each partition fits in the B-2 pages • I/O cost for a read and a write is the same • Ignore the cost of writing the join results • Disk I/O Cost • Partitioning Phase: • I/O Cost: 2*M+2*N • Probing Phase • I/O Cost: M+N • Total Cost 3*(M+N) • Which one to use, block-nested loop or Grace join? • Block-nested loop : Cost=M+ * N

  12. Exercise • What is the I/O cost for using Grace hash join? • Ignore the cost of writing the results • Assuming that each partition fits in memory. • Answer • 3*(500+1000)=4500 I/Os

  13. Memory Requirement for Hash Join • Ideally, each partition fits in memory • To increase this chance, we need to minimize partition size, which means to maximize #partitions • Questions • What limits the number of partitions? • What is the minimum memory requirement? • Partition Phase • Probing phase

  14. Original Relation Partitions OUTPUT 1 1 2 INPUT 2 hash function h . . . B-1 B-1 B main memory buffers Disk Disk • Partition Phase • To partition R into K partitions, we need at least K output buffer and one input buffer • Given B buffer pages, the maximum number of partition is B-1 • Assume a uniform distribution, the size of each R partition is equal to M/(B-1)

  15. Partitions of R & S Join Result Hash table for partition Ri (k <= B-2 pages) hash fn h2 h2 Output buffer Input buffer for Si B main memory buffers Disk Disk • Probing phase • # pages for the in-memory hash table built during the probing phase is equal to f*M/(B-1) • f: fudge factor that captures the increase in the hash table size from the buffer size • Total number of pages • since one page for input buffer for S and another page for output buffer

  16. If memory is not enough to store a smaller partition • Divide a partition of R into sub-partitions using another hash function h3 • Divide a partition of S into sub-partitions using another hash function h3 • Sub-partition j of partition i in R only matches sub-partition j of partition i in S

  17. Sort-Merge Join (R S) i=j • Sort R and S on the join attribute, then scan them to do a ``merge’’ (on join col.), and output result tuples. • Advance scan of R until current tuple R.i >= current tuple S.j, then advance scan of S until current S.j >= current R.i; do this until current R.i = current S.j. • At this point, all R tuples with same value in R.i (current R partition) and all S tuples with same value in S.j (current S partiton) match; • output <r, s> for all pairs of such tuples. • Then resume scanning R and S. j i : : Both are sorted

  18. Example of Sort-Merge Join Reserves Sailors • I/O Cost for Sort-Merge Join • Cost of sorting: TBD • O(|R| log |R|)+O(|S| log |S|)?? • Cost of merging: M+N • Could be up to O(M+N) if the inner-relation has to be scanned multiple times (very unlikely)

  19. Sort-Merge Join • Attractive if one relation is already sorted on the join attribute or has a clustered index on the join attribute

  20. Exercise • What is the I/O cost for using a sort-merge join? • Ignore the cost of writing the results • Buffer pool: 100 pages • Answer • Sort Reserves in 2 passes:2*(1000+1000) =4000 • Sort Sailors in 2 passes: 2*(500+500)=2000 • Merge: (1000+500) =1500 • Total cost = 4000+2000+1500 = 7500

  21. Cost comparison for Exercise Example • Block nested loops join: 5500 • Index nested loops join using a hash-index: 88500 • Grace Hash Join: 4500 • Sort-Merge Join: 7500

  22. Why Sort? • Sort-merge join algorithm involves sorting. • Problem: sort 1Gb of data with 1Mb of RAM. • why not virtual memory?

  23. 2-Way Sort: Requires 3 Buffers • Pass 1: Read a page, sort it, write it. • only one buffer page is used • Pass 2, 3, …, etc.: • three buffer pages used. INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk Disk

  24. Example: Sorting 4 pages INPUT 1 P1 OUTPUT P1 P2 P2 INPUT 2 P1 and P2 are sorted P1 and P2 are sorted individually INPUT 1 P3 OUTPUT P3 P4 P4 INPUT 2 P3 and P4 are sorted P3 and P4 are sorted individidually

  25. INPUT 1 P2 P1 OUTPUT P4 P3 P2 P1 INPUT 2 P4 P3 The data in 4 pages are sorted

  26. Two-Way External Merge Sort 6,2 2 Input file 3,4 9,4 8,7 5,6 3,1 PASS 0 1,3 2 1-page runs 3,4 2,6 4,9 7,8 5,6 • Each pass we read + write each page in file. • N pages in the file => the number of passes • So total cost is: • Idea:Divide and conquer: sort subfiles and merge PASS 1 4,7 1,3 2,3 2-page runs 8,9 5,6 2 4,6 PASS 2 2,3 4,4 1,2 4-page runs 6,7 3,5 6 8,9 PASS 3 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9

  27. Two-Phase Multi-Way Merge-Sort • More than 3 buffer pages. How can we utilize them? • Phase 1: • Fill all available main memory with blocks from the original relation to be sorted. • Sort the records in main memory using main memory sorting techniques. • Write the sorted records from main memory onto new blocks of secondary memory, forming one sorted sublist. (There may be any number of these sorted sublists, which we merge in the next phase). Sorted f1 MM Sorted f2 File to be sorted : Sorted fn n = N/M, where N is the file size and P is the main memory in pages

  28. Phase 2: Multiway Merge-Sort • Merge all the sorted sublists into a single sorted list. • Partition MM into n blocks • Load fi to block i Pointers to first unchosen records Select smallest unchosen key from the list for output Output buffer Input buffer, one for each sorted list Done in main memory

  29. Phase 2: Multiway Merge-Sort • Find the smallest key among the first remaining elements of all the lists. (This comparison is done in main memory and a linear search is sufficient. Better technique can be used.) • Move the smallest element to the first available position of the output block. • If the output block is full, write it to disk and reinitialize the same buffer in main memory to hold the next output block. • If the block from which the input smallest element was taken is now exhausted, read the next block from the same sorted sublist into the same buffer. • If no block remains, leave its buffer empty and do not consider elements from that list.

  30. Memory Requirement for Multi-Way Merge-Sort • Partitioning Phase: • Given M pages of main memory and a file of N pages of data, we can partition the file into N/M small files. Sorted f1 MM Sorted f2 File to be sorted : Sorted fn • Merge Phase: • At least M-1 pages of main memory are needed to merge N/M sorted lists, so we have M-1>=N/M, i.e., M(M-1)>=N. INPUT 1 INPUT 2 OUTPUT INPUT n Main memory buffers Disk Disk

  31. In-memory Sort: Quick Sort • pick one element in the array, which will be the pivot. • make one pass through the array, called a partition step, re-arranging the entries so that: • the pivot is in its proper place. • entries to the left of the pivot are smaller than the pivot • entries to its right are larger than the pivot • Detailed Steps: • Starting from left, find the item that is larger than the pivot, • Starting from right, find the item that is smaller than the pivot • Switch left and right • recursively apply quicksort to the part of the array that is to the left of the pivot, and to the part on its right. 1st pass: pivot = 18 18 5 6 19 3 1 5 6 11 21 33 23 left right O(n log n) on average, worst case is O(n2)

  32. Query Cost Estimation • Select: • No index • unsorted data • sorted data • Index • tree index • hash-based index • Join: R S • Simple nested loop • Block nested loop • Grace Hash • Sort-merge

More Related