840 likes | 1.01k Views
Exploiting Multithreaded Architectures to Improve Data Management Operations. Layali Rashid The Advanced Computer Architecture Group @ U of C (ACAG) Department of Electrical and Computer Engineering University of Calgary. Outline. The SMT and the CMP Architectures Join (Hash Join)
E N D
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture Group @ U of C (ACAG) Department of Electrical and Computer Engineering University of Calgary
Outline • The SMT and the CMP Architectures • Join (Hash Join) • Motivation • Algorithm • Results • Sort (Radix and Quick Sorts) • Motivation • Algorithms • Results • Index (CSB+-Tree) • Motivation • Algorithm • Results • Conclusions
The SMT and the CMP Architectures • Simultaneous Multithreading (SMT): multiple threads run simultaneously on a single processor. • Chip Multiprocessor (CMP): more than one processor are integrated on a single chip.
Hash Join Motivation • Hash join is one of the most important operations commonly used in current commercial DBMSs. • The L2 cache load miss rate is a critical factor in main-memory hash join performance. • Increase level of parallelism in hash join.
Architecture-Aware Hash Join (AA_HJ) • Build Index Partition Phase • Tuples divided equally between threads, each thread has its own set of L2-cache size clusters • The Build and Probe Index Partition Phase • One thread builds a hash table from each key-range, other threads index partition the probe relation similar to the previous phase. • Probe Phase • See figure.
AA_HJ Results • We achieve speedups ranging from 2 to 4.6 compared to PT on Quad Intel Xeon Dual Core server. • Speedups for the Pentium 4 with HT ranges between 2.1 to 2.9 compared to PT.
Memory-Analysis for Multithreaded AA_HJ • A decrease in L2 load miss rate is due to the cache-sized index partitioning, constructive cache sharing and Group Prefetching. • A minor increase in L1 data cache load miss rate from 1.5% to 4%.
The Sort Motivation • Some researches find that the sort algorithms suffer from high level two cache miss rates. • Whereas others pointed out that radix sort has high TLB miss rates. • In addition, the fact that most sort algorithms are sequential has high impact on generating efficient parallel sort algorithms. • In our work we target Radix Sort (distribution-based sort) and Quick Sort (comparison-based sort).
Our Parallel Sorts • Radix Sort • A hybrid radix sort between Partition Parallel Radix Sort and Cache-Conscious Radix Sort. • Repartitioning large destination buckets only when they are significantly larger than the L2 cache size. • Quick Sort • Use Fast Parallel Quick Sort. • Dynamically balancing the load across threads. • Improve thread parallelism during the sequential cleaning up sorting. • Stop the recursive partitioning process when the size of the subarray is almost equal to the largest cache size.
The Sort Timing for the Random Datasets on the SMT Arhcitecure • Radix Sort and Quick Sort shows low L1 and L2 caches miss rates on our machines. Radix Sort has a DTLB Store miss rate up to 26%. • Radix Sort accomplishes slight speedup on SMT architectures that doesn’t exceed 3% , due to its CPU-intensive nature. • Enhancements in execution time for quick sort are about 25% to 30%. Radix Sort Quick Sort
The Sort Timing for the Random Datasets on the CMP Architecture • Our speedups for the Radix sort range from 54% for two threads up to 300% for threads from 2 to 8. • Our speedups for the Quick Sort range from 34% to 417%. Radix Sort Quick Sort
The Index Motivation • Despite the fact that CSB+-tree proves to have significant speedup over B+-trees, experiments show that a large fraction of its execution time is still spent waiting for data. • The L2 load miss rate for single-threaded CSB+-tree is as high as 42%.
Dual-threaded CSB+-Tree • One CSB+-Tree. • Single thread for the bulkloading. • Two threads for probing. • Unlike inserts and deletes, search needs no synchronization since it involves reads only.
Index Results • Speedups for dual-threaded CSB+-tree range from 19% to 68% compared to single-threaded CSB+-tree. • Two threads for memory-bound operations propose more chances to keep the functional units working. • Sharing one CSB+-tree amongst both of our threads result in constructive behaviour and reduction of 6% -8% in the L2 miss rate.
Conclusions • State-of-the-art parallel architectures (SMT and CMP) have opened opportunities for the improvement of software operations to better utilize the underlying hardware resources. • It is essential to have efficient implementations of database operations. • We propose architecture-aware multithreaded database algorithms of the most important database operations (joins, sorts and indexes). • We characterize the timing and memory behaviour of these database operations.
Figure 1‑2: Comparison between the SMT and the Dual Core Architectures
Figure 2‑8: Hash Join Base Algorithm partition R into R0, R1,…, Rn-1 partition S into S0, S1,…, Sn-1 for i = 0 until i = n-1 use Ri to build hash-tablei for i = 0 until i = n-1 probe Si using hash-tablei
Figure 2‑10: AA_HJ Probe Index Partitioning Phase Executed by one Thread
Figure 2‑11: AA_HJ S-Relation Partitioning and Probing Phases
Figure 2‑13: Timing for three Hash Join Partitioning Techniques
Figure 2‑14: Memory Usage for three Hash Join Partitioning Techniques
Figure 2‑18: Memory Usage Comparison of all Hash Join Algorithms
Figure 2‑19: Speedups due to the AA_HJ+SMT and the AA_HJ+GP+SMT Algorithms
Figure 2‑20: Varying Number of Clusters for the AA_HJ+GP+SMT
Figure 2‑21: Varying the Selectivity for Tuple Size = 100Bytes
Figure 2‑22: Time Breakdown Comparison for the Hash Join Algorithms for tuple sizes 20Bytes and 100Bytes
Figure 2‑23: Timing for the Multi-threaded Architecture-Aware Hash Join
Figure 2‑24: Speedups for the Multi-Threaded Architecture-Aware Hash Join
Figure 2‑25: Memory Usage for the Multi-Threaded Architecture-Aware Hash Join
Figure 2‑26: Time Breakdown Comparison for Hash Join Algorithms
Figure 2‑27: The L1 Data Cache Load Miss Rate for NPT and AA_HJ