370 likes | 513 Views
Inspector Joins. By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005. Rammohan Narendula. Introduction. Query execution is I/O bound- so most of the research concentrates on main memory Goal- reduce no. of page faults thus reduce no. of disk I/Os.
E N D
Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan Narendula 1
Introduction Query execution is I/O bound- so most of the research concentrates on main memory Goal- reduce no. of page faults thus reduce no. of disk I/Os However, hash join is a special class of techniques where hash-join becomes CPU bound given sufficient I/O bandwidth and employing Advanced I/O techniques (I/O prefetching) Goal- reduce no. of cache misses 2
Exploiting Information about Data • Ability to improve query depends on information quality • General stats on relations are inadequate • May lead to incorrect decisions for specific queries • Especially true for join queries • Previous approaches exploiting dynamic information • Collecting information from previous queries • Multi-query optimization [Sellis’88] • Materialized views [Blakeley et al. 86] • Join indices [Valduriez’87] • Dynamic re-optimization of query plans [Kabra&DeWitt’98] [Markl et al. 04] This study exploits the inner structure of hash joins 3
Join I/O Partitioning Extra information greatly helps phase 2 Exploiting Multi-Pass Structure of Hash Joins • Idea: • Examine the actual data in I/O partitioning phase • Extract useful information to improve join phase Inspection 4
decide Using Extracted Information • Enable a new join phase algorithm • Reduce the primary performance bottleneck in hash joins i.e. Poor CPU cache performance • Optimized for multi-processor systems • Choose the most suitable join phase algorithm for special input cases Join Phase Simple Hash Join Inspection Cache Partitioning I/O Partitioning Cache Prefetching Extracted Information New Algorithm 5
Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 6
Hash Table GRACE Hash Join • I/O Partitioning Phase: • Divide input relations into partitions with a hash function Probe Build Over 70% execution time stalled on cache misses! • Join Phase: (simple hash join) • Build hash table, then probe hash table Probe Build • Random memory accesses cause poor CPU cache performance 7
Cache-sized Partitions Cache Partitioning • Recursively produce cache-sized partitions after I/O partitioning • Avoid cache misses when joining cache-sized partitions • Overhead of re-partitioning Memory-sized Partitions Probe Build 8
Hash Table Build Probe Cache Prefetching • Reduce impact of cache misses • Exploit available memory bandwidth • Overlap cache misses and computations • Insert cache prefetch instructions into code • Still incurs the same number of cache misses 9
Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 10
CPU CPU CPU CPU Cache Cache Cache Cache Shared bus Main Memory Build4 Build1 Build2 Build3 Probe2 Probe4 Probe3 Probe1 Hash Joins on SMP Systems • Previous studies mainly focus on uni-processors • Memory bandwidth is precious • It becomes the bottleneck in cache-prefetching techniques • Each processor joins a pair of partitions in join phase 11
I/O Partitioning Join Sub-partition 0 Sub-partition 1 Sub-partition 2 Inspector Joins • Extracted information:summary of matching relationships • Every K contiguous pages in a build partition forms a sub-partition • Tells which sub-partition(s) every probe tuple matches Probe Partition Build Partition Summary of Matching Relationship 12
I/O Partitioning Join CPU Cache Hash Table Cache-Stationary Join Phase • Recall cache partitioning: re-partition cost Copying cost Copying cost Build Partition Probe Partition • We want to achieve zero copying 13
I/O Partitioning Join CPU Cache Hash Table Cache-Stationary Join Phase • Joins a sub-partition and its matching probe tuples • Sub-partition is small enough to fit in CPU cache • Cache prefetching for the remaining cache misses • Zero copying for generating recursive cache-sized partitions Sub-partition 0 Sub-partition 1 Sub-partition 2 Build Partition Probe Partition 14
I/O Partitioning Join Probe Relation Build Relation Filters in I/O Partitioning • How to extract the summary efficiently? • Extend filter scheme in commercial hash joins • Conventional single-filter scheme • Represent all build join keys • Filter out probe tuples having no matches Filter Mem-sized Partitions Test Construct 15
0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 Background: Bloom Filter • A bit vector • A key is hashed d (e.g. d=3) times and represented by d bits • Construct: for every build join key, set its 3 bits in vector • Test: given a probe join key, check if all its 3 bits are 1 • Discard the tuple if some bits are 0 • May have false positives Bit0=H0(key) Bit1=H1(key) Bit2=H2(key) Filter 16
I/O Partitioning Join Sub0,0Sub0,1Sub0,2 Sub1,0Sub1,1Sub1,2 Sub2,0Sub2,1Sub2,2 Multi-Filter Scheme • Single filter: a probe tuple entire build relation • Our goal: a probe tuple sub-partitions • Construct a filter for every sub-partition • Replace a single large filter with multiple small filters Single Filter Multi-Filter Partition 0 Partition 1 Build Relation Partition 2 17
I/O Partitioning Join Testing Multi-Filters When partitioning the probe relation • Test a probe tuple against all the filters of a partition • Tells which sub-partition(s) the tuple may have matches • Store summary of matching relationships in partitions • This information is used to extract probe tuples in the order of partition IDs. A special array is constructed using count sort technique for this purpose. Test Partition 0 Partition 1 Multi-Filter Probe Relation Partition 2 18
Cont’d… • Extracting probe tuple information for every sub-partition using counting sort • One array for each sub partition. Size of the array is number of matching probe tuples for that partition. • The tuples are never visited or copied in the coutning sort. • Joining pair of build and probe sub-partitions 19
Partition 0 0 0 1 S filters 1 1 1 0 1 1 Partition 1 Multi-Filter Probe Relation Partition 2 Minimizing Cache Misses for Testing Filters • Single filter scheme: • Compute 3 bit positions • Test 3 bits • Multi-filter scheme: if there are S sub-partitions in a partition • Compute 3 bit positions • Test the same 3 bits for every filter, altogether 3*S bits • May cause 3*S cache misses ! Test 20
I/O Partitioning Join 0 1 0 0 1 1 1 1 1 Vertical Filters for Testing • Bits at the same position are contiguous in memory • 3 cache misses instead of 3*S cache misses! S filters Contiguous in memory Test Partition 0 Partition 1 Probe Relation Partition 2 • Horizontal vertical conversion after partitioning build relation • Very small overhead in practice 21
Outline • Motivation • Previous hash join algorithms • Hash join performance on SMP systems • Inspector join • Experimental results • Conclusions 22
Experimental Setup • Relation schema: 4-byte join attribute + fixed length payload • No selection, no projection • 50MB memory per CPU available for the join phase • Same join algorithm run on every CPU joining different partitions • Detailed cycle-by-cycle simulations • A shared-bus SMP system with 1.5GHz processors • Memory hierarchy is based on Itanium 2 processor 23
GRACECache prefetchingCache partitioningEnhanced cache partitioningInspector join Partition Phase Wall-Clock Time • I/O partitioning can take advantage of multiple CPUs • Cut input relations into equal-sized chunks • Partition one chunk on every CPU • Concatenate outputs from all CPUs • Enhanced cache partitioning: cache partitioning + advanced prefetching • Inspection incurs very small overhead • Ratio of execution time with best algo- 0.88 to 0.94 • Mainly computation cost of converting horizontal filters to vertical and testing • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 24
GRACE Cache prefetching Cache partitioning Enhanced cache partitioning Inspector join Join Phase Aggregate Time • Inspector join achieves significantly better performancewhen 8 or more CPUs are used • Because of local optimization + catch prefetching • 1.7-2.1X speedups over cache prefetching • Memory B/W becomes bottleneck when more no of processors are used • 1.6-2.0X speedups over enhanced cache partitioning • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 25
decide Results on Choosing Suitable Join Phase • Case #1: a large number of duplicate build join keys • Choose enhanced cache partitioning • When a probe tuple on average matches 4 or more sub-partitions • Case #2: nearly sorted input relations • Surprisingly: cache-stationary join is very good Join Phase Simple Hash Join Inspection Cache Partitioning I/O Partitioning Cache Prefetching Extracted Info Cache Stationary 26
Conclusions • Exploit multi-pass structure for higher quality info about data • Achieve significantly better cache performance • 1.6X speedups over previous cache-friendly algorithms • When 8 or more CPUs are used • Choose most suitable algorithms for special input cases • Idea may be applicable to other multi-pass algorithms 27
Thank You ! 28
GRACECache partitioningCache prefetching Bandwidth-sharing Number of CPUs used Number of CPUs used Re-partition cost Previous Algorithms on SMP Systems Wall clock time Aggregate time on all CPUs • Join phase performance of joining a 500MB and a 2GB relations (details later in the talk) • Aggregate performance degrades dramatically over 4 CPUs Reduce data movement (memory to memory, memory to cache) 29
More Details in Paper • Moderate memory space requirement for filters • Summary information representation in intermediate partitions • Preprocessing for cache-stationary join phase • Prefetching for improving efficiency and robustness 30
GRACECache prefetchingCache partitioningInspector join Partition Phase Wall-Clock Time • I/O partitioning can take advantage of multiple CPUs • Cut input relations into equal-sized chunks • Partition one chunk on every CPU • Concatenate outputs from all CPUs • Inspection incurs very small overhead • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 31
GRACE Cache prefetching Cache partitioning Inspector join Join Phase Aggregate Time • Inspector join achieves significantly better performancewhen 8 or more CPUs are used • 1.7-2.1X speedups over cache prefetching • 1.6-2.0X speedups over enhanced cache partitioning • 500MB joins 2GB • 100B tuples, 4B keys • 50% probe tuples no matches • A build matches 2 probe tuples Number of CPUs used 32
Hash Table Build Probe CPU-Cache-Friendly Hash Joins • Recent studies focus on CPU cache performance • I/O partitioning gives good I/O performance • Random memory accesses cause poor CPU cache performance • Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00] • Recursively produce cache-sized partitions from memory-sized partitions • Avoid cache misses during join phase • Pay re-partitioning cost • Cache Prefetching [Chen et al. 04] • Exploit memory system parallelism • Use prefetches to overlap multiple cache misses and computations 33
Example Special Input Cases • Example case #1: a large number of duplicate build join keys • Count the average number of sub-partitions a probe tuple matches • Must check the tuple against all possible sub-partitions • If too large, cache stationary join works poorly • Example case #2: nearly sorted input relations • A merge-based join phase might be better? A probe tuple Build Partition Sub-partition 0 Probe Partition Sub-partition 1 Sub-partition 2 34
Varying Number of Duplicates per Build Join Key • Join phase aggregate performance • Choose enhanced cache part • When a probe tuple on average matches 4 or more sub-partitions 35
Nearly Sorted Cases • Sort both input relations, then randomly move 0%-5% of tuples • Join phase aggregate performance • Surprisingly: cache-stationary join is very good • Even better than merge join when over 1% tuples are out-of-order 36
Analyzing Nearly Sorted Case • Partitions are also nearly sorted • Probe tuples matching a sub-partition are almost contiguous • Similar memory behavior as merge join • No cost for sorting out-of-order tuples A probe tuple Build Partition Sub-partition 0 Probe Partition Sub-partition 1 Sub-partition 2 Nearly Sorted Nearly Sorted 37