350 likes | 369 Views
This study delves into improving hash join performance on SMP systems by utilizing extracted information to optimize the join phase algorithms. The focus is on reducing CPU cache bottlenecks and selecting the most suitable join phase algorithm for specific input cases. By exploring the inner structure of hash joins and employing cache partitioning, I/O partitioning, and cache prefetching, this research aims to enhance the overall efficiency of hash join operations. The approach involves a novel Inspector Join algorithm that leverages information extracted during I/O partitioning to streamline the join process. Experimental results and conclusions are discussed to highlight the effectiveness of the proposed methodology.
E N D
Carnegie Mellon University 2 Intel Research Pittsburgh Inspector Joins 1 Shimin Chen 1 AnastassiaAilamaki 2 1,2 ToddC.Mowry PhillipB.Gibbons 1
Exploiting Information about Data • Ability to improve query depends on information quality • General stats on relations are inadequate • May lead to incorrect decisions for specific queries • Especially true for join queries • Previous approaches exploiting dynamic information • Collecting information from previous queries • Multi-query optimization [Sellis’88] • Materialized views [Blakeley et al. 86] • Join indices [Valduriez’87] • Dynamic re-optimization of query plans [Kabra&DeWitt’98] [Markl et al. 04] This study exploits the inner structure of hash joins 2
Join I/O Partitioning Extra information greatly helps phase 2 Exploiting Multi-Pass Structure of Hash Joins • Idea: • Examine the actual data in I/O partitioning phase • Extract useful information to improve join phase Inspection 3
decide Using Extracted Information • Enable a new join phase algorithm • Reduce the primary performance bottleneck in hash joins i.e. Poor CPU cache performance • Optimized for multi-processor systems • Choose the most suitable join phase algorithm for special input cases Join Phase Simple Hash Join Inspection Cache Partitioning I/O Partitioning Cache Prefetching Extracted Information New Algorithm 4
Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 5
Hash Table GRACE Hash Join • I/O Partitioning Phase: • Divide input relations into partitions with a hash function Probe Build Over 70% execution time stalled on cache misses! • Join Phase: (simple hash join) • Build hash table, then probe hash table Probe Build • Random memory accesses cause poor CPU cache performance 6
Cache-sized Partitions Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00] • Recursively produce cache-sized partitions after I/O partitioning • Avoid cache misses when joining cache-sized partitions • Overhead of re-partitioning Memory-sized Partitions Probe Build 7
Hash Table Build Probe Cache Prefetching [Chen et al. 04] • Reduce impact of cache misses • Exploit available memory bandwidth • Overlap cache misses and computations • Insert cache prefetch instructions into code • Still incurs the same number of cache misses 8
Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 9
CPU CPU CPU CPU Cache Cache Cache Cache Shared bus Main Memory Build4 Build1 Build2 Build3 Probe2 Probe4 Probe3 Probe1 Hash Joins on SMP Systems • Previous studies mainly focus on uni-processors • Memory bandwidth is precious • Each processor joins a pair of partitions in join phase 10
GRACECache partitioningCache prefetching Bandwidth-sharing Number of CPUs used Number of CPUs used Re-partition cost Previous Algorithms on SMP Systems Wall clock time Aggregate time on all CPUs • Join phase performance of joining a 500MB and a 2GB relations (details later in the talk) • Aggregate performance degrades dramatically over 4 CPUs Reduce data movement (memory to memory, memory to cache) 11
I/O Partitioning Join Sub-partition 0 Sub-partition 1 Sub-partition 2 Inspector Joins • Extracted information:summary of matching relationships • Every K contiguous pages in a build partition forms a sub-partition • Tells which sub-partition(s) every probe tuple matches Probe Partition Build Partition Summary of Matching Relationship 12
I/O Partitioning Join CPU Cache Hash Table Cache-Stationary Join Phase • Recall cache partitioning: re-partition cost Copying cost Copying cost Build Partition Probe Partition • We want to achieve zero copying 13
I/O Partitioning Join CPU Cache Hash Table Cache-Stationary Join Phase • Joins a sub-partition and its matching probe tuples • Sub-partition is small enough to fit in CPU cache • Cache prefetching for the remaining cache misses • Zero copying for generating recursive cache-sized partitions Sub-partition 0 Sub-partition 1 Sub-partition 2 Build Partition Probe Partition 14
I/O Partitioning Join Probe Relation Build Relation Filters in I/O Partitioning • How to extract the summary efficiently? • Extend filter scheme in commercial hash joins • Conventional single-filter scheme • Represent all build join keys • Filter out probe tuples having no matches Filter Mem-sized Partitions Test Construct 15
0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 Background: Bloom Filter • A bit vector • A key is hashed d (e.g. d=3) times and represented by d bits • Construct: for every build join key, set its 3 bits in vector • Test: given a probe join key, check if all its 3 bits are 1 • Discard the tuple if some bits are 0 • May have false positives Bit0=H0(key) Bit1=H1(key) Bit2=H2(key) Filter 16
I/O Partitioning Join Sub0,0Sub0,1Sub0,2 Sub1,0Sub1,1Sub1,2 Sub2,0Sub2,1Sub2,2 Multi-Filter Scheme • Single filter: a probe tuple entire build relation • Our goal: a probe tuple sub-partitions • Construct a filter for every sub-partition • Replace a single large filter with multiple small filters Single Filter Multi-Filter Partition 0 Partition 1 Build Relation Partition 2 17
I/O Partitioning Join Testing Multi-Filters When partitioning the probe relation • Test a probe tuple against all the filters of a partition • Tells which sub-partition(s) the tuple may have matches • Store summary of matching relationships in partitions Test Partition 0 Partition 1 Multi-Filter Probe Relation Partition 2 18
Partition 0 0 0 1 S filters 1 1 1 0 1 1 Partition 1 Multi-Filter Probe Relation Partition 2 Minimizing Cache Misses for Testing Filters • Single filter scheme: • Compute 3 bit positions • Test 3 bits • Multi-filter scheme: if there are S sub-partitions in a partition • Compute 3 bit positions • Test the same 3 bits for every filter, altogether 3*S bits • May cause 3*S cache misses ! Test 19
I/O Partitioning Join 0 1 0 0 1 1 1 1 1 Vertical Filters for Testing • Bits at the same position are contiguous in memory • 3 cache misses instead of 3*S cache misses! S filters Contiguous in memory Test Partition 0 Partition 1 Probe Relation Partition 2 • Horizontal vertical conversion after partitioning build relation • Very small overhead in practice 20
More Details in Paper • Moderate memory space requirement for filters • Summary information representation in intermediate partitions • Preprocessing for cache-stationary join phase • Prefetching for improving efficiency and robustness 21
Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 22
Experimental Setup • Relation schema: 4-byte join attribute + fixed length payload • No selection, no projection • 50MB memory per CPU available for the join phase • Same join algorithm run on every CPU joining different partitions • Detailed cycle-by-cycle simulations • A shared-bus SMP system with 1.5GHz processors • Memory hierarchy is based on Itanium 2 processor 23
GRACECache prefetchingCache partitioningEnhanced cache partitioningInspector join Partition Phase Wall-Clock Time • I/O partitioning can take advantage of multiple CPUs • Cut input relations into equal-sized chunks • Partition one chunk on every CPU • Concatenate outputs from all CPUs • Enhanced cache partitioning: cache partitioning + advanced prefetching • Inspection incurs very small overhead • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 24
GRACE Cache prefetching Cache partitioning Enhanced cache partitioning Inspector join Join Phase Aggregate Time • Inspector join achieves significantly better performancewhen 8 or more CPUs are used • 1.7-2.1X speedups over cache prefetching • 1.6-2.0X speedups over enhanced cache partitioning • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 25
decide Results on Choosing Suitable Join Phase • Case #1: a large number of duplicate build join keys • Choose enhanced cache partitioning • When a probe tuple on average matches 4 or more sub-partitions • Case #2: nearly sorted input relations • Surprisingly: cache-stationary join is very good Join Phase Simple Hash Join Inspection Cache Partitioning I/O Partitioning Cache Prefetching Extracted Info Cache Stationary 26
Conclusions • Exploit multi-pass structure for higher quality info about data • Achieve significantly better cache performance • 1.6X speedups over previous cache-friendly algorithms • When 8 or more CPUs are used • Choose most suitable algorithms for special input cases • Idea may be applicable to other multi-pass algorithms 27
Thank You ! 28
GRACECache prefetchingCache partitioningInspector join Partition Phase Wall-Clock Time • I/O partitioning can take advantage of multiple CPUs • Cut input relations into equal-sized chunks • Partition one chunk on every CPU • Concatenate outputs from all CPUs • Inspection incurs very small overhead • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 29
GRACE Cache prefetching Cache partitioning Inspector join Join Phase Aggregate Time • Inspector join achieves significantly better performancewhen 8 or more CPUs are used • 1.7-2.1X speedups over cache prefetching • 1.6-2.0X speedups over enhanced cache partitioning • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 30
Hash Table Build Probe CPU-Cache-Friendly Hash Joins • Recent studies focus on CPU cache performance • I/O partitioning gives good I/O performance • Random memory accesses cause poor CPU cache performance • Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00] • Recursively produce cache-sized partitions from memory-sized partitions • Avoid cache misses during join phase • Pay re-partitioning cost • Cache Prefetching [Chen et al. 04] • Exploit memory system parallelism • Use prefetches to overlap multiple cache misses and computations 31
Example Special Input Cases • Example case #1: a large number of duplicate build join keys • Count the average number of sub-partitions a probe tuple matches • Must check the tuple against all possible sub-partitions • If too large, cache stationary join works poorly • Example case #2: nearly sorted input relations • A merge-based join phase might be better? A probe tuple Build Partition Sub-partition 0 Probe Partition Sub-partition 1 Sub-partition 2 32
Varying Number of Duplicates per Build Join Key • Join phase aggregate performance • Choose enhanced cache part • When a probe tuple on average matches 4 or more sub-partitions 33
Nearly Sorted Cases • Sort both input relations, then randomly move 0%-5% of tuples • Join phase aggregate performance • Surprisingly: cache-stationary join is very good • Even better than merge join when over 1% tuples are out-of-order 34
Analyzing Nearly Sorted Case • Partitions are also nearly sorted • Probe tuples matching a sub-partition are almost contiguous • Similar memory behavior as merge join • No cost for sorting out-of-order tuples A probe tuple Build Partition Sub-partition 0 Probe Partition Sub-partition 1 Sub-partition 2 Nearly Sorted Nearly Sorted 35