340 likes | 351 Views
This paper presents GPU-based implementations of the Apriori algorithm for frequent itemset mining, utilizing bitmap and trie data structures. The performance of the implementations is evaluated and compared.
E N D
Frequent Itemset Mining on Graphics Processors Wenbin Fang , Mian Lu, Xiangye Xiao, Bingsheng He1, Qiong Luo Hong Kong Univ. of Sci. and Tech. Microsoft Research Asia1 Presenter: Wenbin Fang
Outline • Contribution • Introduction • Design • Evaluation • Conclusion 2/33
Contribution • Accelerate the Apriori algorithm for Frequent Itemset Mining using Graphics Processors (GPUs). • Two GPU implementations: • Pure Bitmap-based implementation (PBI): processing entirely on the GPU. • Trie-based implementation (TBI): GPU/CPU co-processing. 3/33
Frequent Itemset Mining (FIM) Finding groups of items, or itemsets that co-occur frequently in a transaction database. Minimum support: 2 1-itemsets (frequent items): A : 3 B : 2 C : 3 D : 4 4/33
Frequent Itemset Mining (FIM) Aims at finding groups of items, or itemsets that co-occur frequently in a transaction database. Minimum support: 2 1-itemsets (frequent items): A, B, C, D 2-itemsets: AB: 2 AC: 2 AD: 3 BD: 2 CD: 3 5/33
Frequent Itemset Mining (FIM) Aims at finding groups of items, or itemsets that co-occur frequently in a transaction database. Minimum support: 2 1-itemsets (frequent items): A, B, C, D 2-itemsets: AB, AC, AD, BD, CD 3-itemsets: ABD, ACD 6/33
Graphics Processors (GPUs) • Exist in commodity machines, mainly for graphics rendering. • Specialized for compute-intensive, highly data parallel apps. • Compared with CPUs, GPUs provide 10x faster computational horsepower, and 10x higher memory bandwidth. CPU GPU --From NVIDA CUDA Programming Guide 7/33
Programming on GPUs • OpenGL/DirectX • AMD CTM • NVIDIA CUDA SIMD parallelism (Single Instruction, Multiple Data) The many-core architecture model of the GPU 8/33
Hierarchical multi-threaded in NVIDIA CUDA Thread Block Thread Block Warp Warp Warp Warp Warp Warp … … … … … … A warp = 32 GPU threads => SIMD schedule unit. # of threads in a thread block # of thread blocks 9/33
General Purpose GPU Computing (GPGPU) • Applications utilizing GPUs • Scientific computing • Molecular Dynamics Simulation • Weather forecasting • Linear algebra • Computational finance • Folding@home, Seti@home • Database applications • Basic DB Operators [SIGMOD’04] • Sorting [SIGMOD’06] • Join [SIGMOD’08] 10/33
Our work • As a first step, we consider the GPU-based Apriori, with intention to extend to another efficient FIM algorithm -- FP-growth. • Why Apriori? • a classic algorithm for mining frequent itemsets. • also applied in other data mining tasks, e.g., clustering, and functional dependency. 11/33
The Apriori Algorithm Input: 1) Transaction Database 2) Minimum support Output: All frequent itemsets L1 = {All frequent 1-itemsets} k = 2 While (Lk-1 != empty) { //Generate candidate k-itemsets. Ck <- Self join on Lk-1 Ck <- (K-1)-Subset test on Ck //Generate frequent k-itemsets Lk <- Support Counting on Ck k += 1 } Frequent 1-itemsets Candidate 2-itemsets Frequent 2-itemsets Candidate 3-itemsets Frequent 3-itemsets … Candidate (K-1)-itemsets Frequent (K-1)-itemsets Candidate K-itemsets Frequent K-itemsets 12/33
Outline • Contribution • Introduction • Design • Evaluation • Conclusion 13/33
GPU-based Apriori Input: 1) Transaction Database 2) Minimum support Output: All frequent itemsets Pure Bitmap-based Impl. (PBI) Trie-based Impl. (TBI) L1 = {All frequent 1-itemsets} k = 2 While (Lk-1 != empty) { //Generate candidate k-itemsets. Ck <- Self join on Lk-1 Ck <- (K-1)-Subset test on Ck //Generate frequent k-itemsets Lk <- Support Counting on Ck k += 1 } L1 = {All frequent 1-itemsets} k = 2 While (Lk-1 != empty) { //Generate candidate k-itemsets. Ck <- Self join on Lk-1 Ck <- (K-1)-Subset test on Ck //Generate frequent k-itemsets Lk <- Support Counting on Ck k += 1 } Itemsets: bitmap Candidate generation on the GPU Itemsets: Trie Candidate generation on the CPU Transactions: bitmap Support counting on the GPU Transactions: bitmap Support counting on the GPU 14/33
Horizontal and Vertical data layout Vertical data layout Horizontal data layout Support counting is done on specific itemsets. Scan all transactions • Intersect two transaction lists. • Count the number of transactions • in the intersection result. 15/33
Bitmap representation for transactions # of transactions Intersection = bitwise AND operation Counting = # of 1’s in a string of bits # of itemsets 16/33
Lookup table Lookup table Bitmap representation for transactions # of 1’s = TABLE[12]; // decimal: 12 // binary: 1100 // (a string of bits) 1 byte • Constant memory • Cacheable • 64 KB • Shared by all GPU threads 216 = 65536 17/33
Support Counting on the GPU (Cont.) Thread block 1 Thread block 2 LOOKUP TABLE 2 • Intersect two transaction lists. • Count the number of transaction • in the intersection result. 18/33
Support Counting on the GPU (Cont.) 19/33 Thread Block Access vector type int4 In one instruction Example: Thread 1 Thread 2 AB int int int int int int int int AND AND AND AND AND AND AND AND AD int int int int int int int int ABD int int int int int int int int LOOKUP TABLE Counts: 2 Counts of 1’s for every 16-bit integer Parallel Reduce Support for this itemset Support:2
GPU-based Apriori Input: 1) Transaction Database 2) Minimum support Output: All frequent itemsets L1 = {All frequent 1-itemsets} k = 2 While (Lk-1 != empty) { //Generate candidate k-itemsets. Join Subset test //Generate frequent k-itemsets Support Counting k += 1 } • Candidate Generation • Join • e.g., Join two 2-itemsets to obtain a candidate3-itemset: • AC JOINAD => ACD • Subset test • e.g., Test all 2-subsets of ACD: {AC, AD, CD} Support Counting on the GPU 20/33
GPU-based Apriori Input: 1) Transaction Database 2) Minimum support Output: All frequent itemsets Pure Bitmap-based Impl. (PBI) Trie-based Impl. (TBI) L1 = {All frequent 1-itemsets} k = 2 While (Lk-1 != empty) { //Generate candidate k-itemsets. Ck <- Self join on Lk-1 Ck <- (K-1)-Subset test on Ck //Generate frequent k-itemsets Lk <- Support Counting on Ck k += 1 } L1 = {All frequent 1-itemsets} k = 2 While (Lk-1 != empty) { //Generate candidate k-itemsets. Ck <- Self join on Lk-1 Ck <- (K-1)-Subset test on Ck //Generate frequent k-itemsets Lk <- Support Counting on Ck k += 1 } Itemsets: bitmap Candidate generation on the GPU Itemsets: bitmap Candidate generation on the GPU Itemsets: Trie Candidate generation on the CPU Transactions: bitmap Support counting on the GPU Transactions: bitmap Support counting on the GPU 21/33
Pure Bitmap-based Impl. (PBI) # of items Bitwise OR In Join (e.g., AB JOIN AD = ABD) Binary search In Subset test (e.g., 2-subsets {AB, AD, BD}) # of itemsets One GPU thread generates one candidate itemset. 22/33
GPU-based Apriori Input: 1) Transaction Database 2) Minimum support Output: All frequent itemsets Pure Bitmap-based Impl. (PBI) Trie-based Impl. (TBI) L1 = {All frequent 1-itemsets} k = 2 While (Lk-1 != empty) { //Generate candidate k-itemsets. Ck <- Self join on Lk-1 Ck <- (K-1)-Subset test on Ck //Generate frequent k-itemsets Lk <- Support Counting on Ck k += 1 } L1 = {All frequent 1-itemsets} k = 2 While (Lk-1 != empty) { //Generate candidate k-itemsets. Ck <- Self join on Lk-1 Ck <- (K-1)-Subset test on Ck //Generate frequent k-itemsets Lk <- Support Counting on Ck k += 1 } Itemsets: bitmap Candidate generation on the GPU Itemsets: Trie Candidate generation on the CPU Itemsets: Trie Candidate generation on the CPU Transactions: bitmap Support counting on the GPU Transactions: bitmap Support counting on the GPU 23/33
Trie-based Impl. (TBI) Depth 0 Root Depth 1 A A B B C C D 1-itemsets: {A, B, C, D} Depth 2 B B C C D D D D D D D 2-itemsets: {AB, AC, AD, BD, CD} AB JOIN AC = ABC C D D {AB, AC, BC} Candidate 3-itemsets: { ABD, ACD} AB JOIN AD = ABD {AB, AD, BD} On CPU 1, Irregular memory access 2, Branch divergence AC JOIN AD = ACD {AC, AD, CD} 24/33
Outline • Contribution • Introduction • Design • Evaluation • Conclusion 25/33
Experimental setup 26/33 Platform: Experimental datasets: Density = Avg. Length / # items
Apriori Implementations Best Apriori implementation in FIMI repository. (Frequent Itemset Mining Implementations Repository) 27/33
TBI-CPU vs GOETHALS Dense Dataset - Chess Sparse Dataset- Retail The impact of using bitmap representation for transactions in support counting. 1.2x ~ 25.7x 28/33
TBI-GPU vs TBI-CPU Dense Dataset - Chess Sparse Dataset- Retail The impact of GPU acceleration in support counting. 1.1x ~ 7.8x 29/33
PBI-GPU vs TBI-GPU Dense Dataset - Chess Sparse Dataset- Retail The impact of bitmap-based itemset and trie-based itemset in candidate generation. PBI-GPU is faster in dense dataset. TBI-GPU is better in sparse dataset. 30/33
PBI-GPU/TBI-CPU vs BORGELT Dense Dataset - Chess Sparse Dataset- Retail Comparison to the best Apriori implementation in FIMI. 1.2x ~ 24.2x 31/33
Comparison to FP-growth With minsup 1%, 60%, and 0.01% PARSEC benchmark 32/33
Conclusion • GPU-based Apriori • Pure Bitmap-based impl. • Bitmap Representation for itemsets. • Bitmap Representation for transactions. • GPU processing. • Trie-based impl. • Trie Representation for itemsets. • Bitmap Representation for transactions. • GPU + CPU co-processing. • Better than CPU-based Apriori. • Still worse than CPU-based FP-growth 33/33
Backup Slide Time Breakdown Time breakdown on dense dataset Chess Time breakdown on dense dataset Retail