1 / 35

Inspector Joins

This study delves into improving hash join performance on SMP systems by utilizing extracted information to optimize the join phase algorithms. The focus is on reducing CPU cache bottlenecks and selecting the most suitable join phase algorithm for specific input cases. By exploring the inner structure of hash joins and employing cache partitioning, I/O partitioning, and cache prefetching, this research aims to enhance the overall efficiency of hash join operations. The approach involves a novel Inspector Join algorithm that leverages information extracted during I/O partitioning to streamline the join process. Experimental results and conclusions are discussed to highlight the effectiveness of the proposed methodology.

Download Presentation

Inspector Joins

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Carnegie Mellon University 2 Intel Research Pittsburgh Inspector Joins 1 Shimin Chen 1 AnastassiaAilamaki 2 1,2 ToddC.Mowry PhillipB.Gibbons 1

  2. Exploiting Information about Data • Ability to improve query depends on information quality • General stats on relations are inadequate • May lead to incorrect decisions for specific queries • Especially true for join queries • Previous approaches exploiting dynamic information • Collecting information from previous queries • Multi-query optimization [Sellis’88] • Materialized views [Blakeley et al. 86] • Join indices [Valduriez’87] • Dynamic re-optimization of query plans [Kabra&DeWitt’98] [Markl et al. 04] This study exploits the inner structure of hash joins 2

  3. Join I/O Partitioning Extra information greatly helps phase 2 Exploiting Multi-Pass Structure of Hash Joins • Idea: • Examine the actual data in I/O partitioning phase • Extract useful information to improve join phase Inspection 3

  4. decide Using Extracted Information • Enable a new join phase algorithm • Reduce the primary performance bottleneck in hash joins i.e. Poor CPU cache performance • Optimized for multi-processor systems • Choose the most suitable join phase algorithm for special input cases Join Phase Simple Hash Join Inspection Cache Partitioning I/O Partitioning Cache Prefetching Extracted Information New Algorithm 4

  5. Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 5

  6. Hash Table GRACE Hash Join • I/O Partitioning Phase: • Divide input relations into partitions with a hash function Probe Build Over 70% execution time stalled on cache misses! • Join Phase: (simple hash join) • Build hash table, then probe hash table Probe Build • Random memory accesses cause poor CPU cache performance 6

  7. Cache-sized Partitions Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00] • Recursively produce cache-sized partitions after I/O partitioning • Avoid cache misses when joining cache-sized partitions • Overhead of re-partitioning Memory-sized Partitions Probe Build 7

  8. Hash Table Build Probe Cache Prefetching [Chen et al. 04] • Reduce impact of cache misses • Exploit available memory bandwidth • Overlap cache misses and computations • Insert cache prefetch instructions into code • Still incurs the same number of cache misses 8

  9. Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 9

  10. CPU CPU CPU CPU Cache Cache Cache Cache Shared bus Main Memory Build4 Build1 Build2 Build3 Probe2 Probe4 Probe3 Probe1 Hash Joins on SMP Systems • Previous studies mainly focus on uni-processors • Memory bandwidth is precious • Each processor joins a pair of partitions in join phase 10

  11. GRACECache partitioningCache prefetching Bandwidth-sharing Number of CPUs used Number of CPUs used Re-partition cost Previous Algorithms on SMP Systems Wall clock time Aggregate time on all CPUs • Join phase performance of joining a 500MB and a 2GB relations (details later in the talk) • Aggregate performance degrades dramatically over 4 CPUs  Reduce data movement (memory to memory, memory to cache) 11

  12. I/O Partitioning Join Sub-partition 0 Sub-partition 1 Sub-partition 2 Inspector Joins • Extracted information:summary of matching relationships • Every K contiguous pages in a build partition forms a sub-partition • Tells which sub-partition(s) every probe tuple matches Probe Partition Build Partition Summary of Matching Relationship 12

  13. I/O Partitioning Join CPU Cache Hash Table Cache-Stationary Join Phase • Recall cache partitioning: re-partition cost Copying cost Copying cost Build Partition Probe Partition • We want to achieve zero copying 13

  14. I/O Partitioning Join CPU Cache Hash Table Cache-Stationary Join Phase • Joins a sub-partition and its matching probe tuples • Sub-partition is small enough to fit in CPU cache • Cache prefetching for the remaining cache misses • Zero copying for generating recursive cache-sized partitions Sub-partition 0 Sub-partition 1 Sub-partition 2 Build Partition Probe Partition 14

  15. I/O Partitioning Join Probe Relation Build Relation Filters in I/O Partitioning • How to extract the summary efficiently? • Extend filter scheme in commercial hash joins • Conventional single-filter scheme • Represent all build join keys • Filter out probe tuples having no matches Filter Mem-sized Partitions Test Construct 15

  16. 0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 Background: Bloom Filter • A bit vector • A key is hashed d (e.g. d=3) times and represented by d bits • Construct: for every build join key, set its 3 bits in vector • Test: given a probe join key, check if all its 3 bits are 1 • Discard the tuple if some bits are 0 • May have false positives Bit0=H0(key) Bit1=H1(key) Bit2=H2(key) Filter 16

  17. I/O Partitioning Join Sub0,0Sub0,1Sub0,2 Sub1,0Sub1,1Sub1,2 Sub2,0Sub2,1Sub2,2 Multi-Filter Scheme • Single filter: a probe tuple  entire build relation • Our goal: a probe tuple  sub-partitions • Construct a filter for every sub-partition • Replace a single large filter with multiple small filters Single Filter Multi-Filter Partition 0 Partition 1 Build Relation Partition 2 17

  18. I/O Partitioning Join Testing Multi-Filters When partitioning the probe relation • Test a probe tuple against all the filters of a partition • Tells which sub-partition(s) the tuple may have matches • Store summary of matching relationships in partitions Test Partition 0 Partition 1 Multi-Filter Probe Relation Partition 2 18

  19. Partition 0 0 0 1 S filters 1 1 1 0 1 1 Partition 1 Multi-Filter Probe Relation Partition 2 Minimizing Cache Misses for Testing Filters • Single filter scheme: • Compute 3 bit positions • Test 3 bits • Multi-filter scheme: if there are S sub-partitions in a partition • Compute 3 bit positions • Test the same 3 bits for every filter, altogether 3*S bits • May cause 3*S cache misses ! Test 19

  20. I/O Partitioning Join 0 1 0 0 1 1 1 1 1 Vertical Filters for Testing • Bits at the same position are contiguous in memory • 3 cache misses instead of 3*S cache misses! S filters Contiguous in memory Test Partition 0 Partition 1 Probe Relation Partition 2 • Horizontal  vertical conversion after partitioning build relation • Very small overhead in practice 20

  21. More Details in Paper • Moderate memory space requirement for filters • Summary information representation in intermediate partitions • Preprocessing for cache-stationary join phase • Prefetching for improving efficiency and robustness 21

  22. Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 22

  23. Experimental Setup • Relation schema: 4-byte join attribute + fixed length payload • No selection, no projection • 50MB memory per CPU available for the join phase • Same join algorithm run on every CPU joining different partitions • Detailed cycle-by-cycle simulations • A shared-bus SMP system with 1.5GHz processors • Memory hierarchy is based on Itanium 2 processor 23

  24. GRACECache prefetchingCache partitioningEnhanced cache partitioningInspector join Partition Phase Wall-Clock Time • I/O partitioning can take advantage of multiple CPUs • Cut input relations into equal-sized chunks • Partition one chunk on every CPU • Concatenate outputs from all CPUs • Enhanced cache partitioning: cache partitioning + advanced prefetching • Inspection incurs very small overhead • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 24

  25. GRACE Cache prefetching Cache partitioning Enhanced cache partitioning Inspector join Join Phase Aggregate Time • Inspector join achieves significantly better performancewhen 8 or more CPUs are used • 1.7-2.1X speedups over cache prefetching • 1.6-2.0X speedups over enhanced cache partitioning • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 25

  26. decide Results on Choosing Suitable Join Phase • Case #1: a large number of duplicate build join keys • Choose enhanced cache partitioning • When a probe tuple on average matches 4 or more sub-partitions • Case #2: nearly sorted input relations • Surprisingly: cache-stationary join is very good Join Phase Simple Hash Join Inspection Cache Partitioning I/O Partitioning Cache Prefetching Extracted Info Cache Stationary 26

  27. Conclusions • Exploit multi-pass structure for higher quality info about data • Achieve significantly better cache performance • 1.6X speedups over previous cache-friendly algorithms • When 8 or more CPUs are used • Choose most suitable algorithms for special input cases • Idea may be applicable to other multi-pass algorithms 27

  28. Thank You ! 28

  29. GRACECache prefetchingCache partitioningInspector join Partition Phase Wall-Clock Time • I/O partitioning can take advantage of multiple CPUs • Cut input relations into equal-sized chunks • Partition one chunk on every CPU • Concatenate outputs from all CPUs • Inspection incurs very small overhead • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 29

  30. GRACE Cache prefetching Cache partitioning Inspector join Join Phase Aggregate Time • Inspector join achieves significantly better performancewhen 8 or more CPUs are used • 1.7-2.1X speedups over cache prefetching • 1.6-2.0X speedups over enhanced cache partitioning • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 30

  31. Hash Table Build Probe CPU-Cache-Friendly Hash Joins • Recent studies focus on CPU cache performance • I/O partitioning gives good I/O performance • Random memory accesses cause poor CPU cache performance • Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00] • Recursively produce cache-sized partitions from memory-sized partitions • Avoid cache misses during join phase • Pay re-partitioning cost • Cache Prefetching [Chen et al. 04] • Exploit memory system parallelism • Use prefetches to overlap multiple cache misses and computations 31

  32. Example Special Input Cases • Example case #1: a large number of duplicate build join keys • Count the average number of sub-partitions a probe tuple matches • Must check the tuple against all possible sub-partitions • If too large, cache stationary join works poorly • Example case #2: nearly sorted input relations • A merge-based join phase might be better? A probe tuple Build Partition Sub-partition 0 Probe Partition Sub-partition 1 Sub-partition 2 32

  33. Varying Number of Duplicates per Build Join Key • Join phase aggregate performance • Choose enhanced cache part • When a probe tuple on average matches 4 or more sub-partitions 33

  34. Nearly Sorted Cases • Sort both input relations, then randomly move 0%-5% of tuples • Join phase aggregate performance • Surprisingly: cache-stationary join is very good • Even better than merge join when over 1% tuples are out-of-order 34

  35. Analyzing Nearly Sorted Case • Partitions are also nearly sorted • Probe tuples matching a sub-partition are almost contiguous • Similar memory behavior as merge join • No cost for sorting out-of-order tuples A probe tuple Build Partition Sub-partition 0 Probe Partition Sub-partition 1 Sub-partition 2 Nearly Sorted Nearly Sorted 35

More Related