Cache-conscious Frequent Pattern Mining on a Modern Processor

Cache-conscious Frequent Pattern Mining on a Modern Processor Amol Ghoting, Gregory Buehrer, and Srinivasan Parthasarathy Data Mining Research Laboratory, CSE The Ohio State University Daehyun Kim, Anthony Nguyen, Yen-Kuang Chen, and Pradeep Dubey Intel Corporation

Roadmap • Motivation and Contributions • Background • Performance characterization • Cache-conscious optimizations • Related work • Conclusions

Motivation Note: Experiment conducted on specialized hardware • Data mining applications • Rapidly growing segment in commerce and science • Interactive  response time is important • Compute- and memory- intensive • Modern architectures • Memory wall • Instruction level parallelism (ILP) FP-Growth 2.4x 1.6x SATURATION

Contributions • We characterize the performance and memory access behavior of three state-of-the-art frequent pattern mining algorithms • We improve the performance of the three frequent pattern mining algorithms • Cache-conscious prefix tree • Spatial locality + hardware pre-fetching • Path tiling to improve temporal locality • Co-scheduling to improve ILP on a simultaneous multi-threaded (SMT) processor

Frequent pattern mining (1) • Finds groups of items that co-occur frequently in a transactional data set • Example: • Minimum support = 2 • Frequent patterns: • Size 1: (milk), (bread), (eggs), (sugar) • Size 2: (milk & bread), (milk & eggs), (sugar & eggs) • Size 3: None • Search space traversal: breadth-first or depth-first Customer 1: milk bread cereal Customer 2: milk bread eggs sugar Customer 3: milk bread butter Customer 4: eggs sugar

Frequent pattern mining (2) • Algorithms under study • FP-Growth (based on the pattern-growth method) • Winner of the 2003 Frequent Itemset Mining Implementations (FIMI) evaluation • Genmax (depth-first search space traversal) • Maximal pattern miner • Apriori (breadth-first search space traversal) • All algorithms use a prefix tree as a data set representation

Setup • Intel Xeon processor • At 2Ghz with 4GB of main memory • 4-way 8KB L1 data cache • 8-way 512KB L2 cache on chip • 8-way 2MB L3 cache • Intel VTune Performance Analyzers to collect performance data • Implementations • FIMI repository • FP-Growth (Gosta Grahne and Jianfei Zhu) • Genmax (Gosta Grahne and Jianfei Zhu) • Apriori (Christian Borgelt) • Custom memory managers

Execution time breakdown Support counting in a prefix tree Similar to Count-FPGrowth ()

Operation mix Note: Each column need not sum up to 1

Memory access behavior Count-FPGrowth() Count-GM() Count-Apriori() L1 hit rate 89% 87% 86% L2 hit rate 43% 42% 49% L3 hit rate 39% 40% 27% L3 misses per instruction 0.03 0.03 0.04 CPU utilization 9% 9% 8%

FP-tree r r r a:1 a:2 a:4 f:1 c:1 c:1 c:2 c:3 b:1 b:1 f:1 f:3 f:2 p:1 m:1 m:2 m:1 b:1 b:1 Minimum support = 3 ITEM p:1 p:1 p:2 m:1 m:1 COUNT • Index based on largest common prefix Typically results in a compressed data set representation Node pointers allow for fast searching of items NODE POINTER CHILD POINTERS PARENT POINTER

FP-Growth r a:4 f:1 c:1 • Basic step: • For each item in the FP-tree, build conditional FP-tree • Recursively mine the conditional FP-tree c:3 b:1 b:1 f:3 p:1 • Dynamic data structure and only two node fields are used through the bottom up traversal • Poor spatial locality • Large data structure • Poor temporal locality • Pointer de-referencing • Poor instruction level parallelism (ILP) m:2 b:1 ITEM COUNT p:2 m:1 NODE POINTER CHILD POINTERS r Conditional FP-tree for m: PARENT POINTER a:3 Mine conditional FP-tree c:3 We process items p, f, c, a, b similarly f:3

Cache-conscious prefix tree • Improves cache line utilization • Smaller node size + DFS allocation • Allows the use of hardware cache line pre-fetching for bottom-up traversals r Header lists (for Node pointers) a f c a o b o o o c c b b o o f ITEM o o m o o PARENT POINTER p o o f p COUNT CHILD POINTERS Count lists (for Node counts) m b NODE POINTER o a p m o o o b o o c o o f o o m o o p DFS allocation

Path tiling to improve temporal locality r • DFS order enables breaking the tree into tiles based on node addresses • These tiles can partially overlap • Maximize tree reuse • All tree accesses are restructured to operate on a tile-by-tile basis Tile 2 Tile N-1 Tile N Tile 1

Improving ILP using SMT Same data but different computation Thread1 Thread2 r • Simultaneous multi-threading (SMT) • Intel hyper-threading (HT) maintains two hardware contexts on chip • Improves instruction level parallelism (ILP) • When one thread waits, the other thread can use CPU resources • Identifying independent threads is not good enough • Unlikely to hide long cache miss latency • Can lead to cache interference (conflicts) • Solution: Restructure multi-threaded computation to reuse cache on a tile-by-tile basis Tile 2 Tile N-1 Tile N Tile 1

Speedup for FP-Growth (Synthetic data set) 4000 4500 5000 5500

Speedup for FP-Growth(Real data set) For FP-Growth, L1 hit rate improves from 89% to 94% L2 hit rate improves from 43% to 98% Speedup Genmax – up to 4.5x Apriori – up to 3.5x 50000 58350 66650 75000

Related work (1) • Data mining algorithms • Characterizations • Self organizing maps • Kim et al. [WWC99] • C4.5 • Bradford and Fortes [WWC98] • Sequence mining, graph mining, outlier detection, clustering, and decision tree induction • Ghoting et al. [DAMON05] • Memory placement techniques for association rule mining • Considered the effects of memory pooling and coarse grained spatial locality on association rule mining algorithms in a serial and parallel setting • Parthasarathy et al. [SIGKDD98,KAIS01]

Related work (2) • Data base algorithms • DBMS on modern hardware • Ailamaki et al. [VLDB99,VLDB2001] • Cache sensitive search trees and B+-trees • Rao and Ross [VLDB99,SIGMOD00] • Prefetching for B+-trees and Hash-Join • Chen et al. [SIGMOD00,ICDE04]

Ongoing and future work • Algorithm re-design for next generation architectures • e.g. graph mining on multi-core architectures • Cache-conscious optimizations for other data mining and bioinformatics applications on modern architectures • e.g. classification algorithms, graph mining • Out-of-core algorithm designs • Microprocessor design targeted at data mining algorithms

Conclusions • Characterized the performance of three popular frequent pattern mining algorithms • Proposed a tile-able cache-conscious prefix tree • Improves spatial locality and allows for cache line pre-fetching • Path tiling improves temporal locality • Proposed novel thread-based decomposition for improving ILP by utilizing SMT • Overall, up to 4.8-fold speedup • Effective algorithm design in data mining needs to take into account modern architectural designs.

Thanks • We would like to acknowledge the following grants • NSF: CAREER-IIS-0347662 • NSF: NGS-CNS-0406386 • NSF: RI-CNS-0403342 • DOE: ECPI-DE-FG02-04ER25611

Cache-conscious Frequent Pattern Mining on a Modern Processor

Cache-conscious Frequent Pattern Mining on a Modern Processor

Presentation Transcript

Frequent Pattern Mining

Cache-Conscious Data Placement

Summarization of Frequent Pattern Mining

Frequent Pattern Mining in Data Streams

On Frequent Chatters Mining

A Comparison of Cache-conscious and Cache-oblivious Programs

Cache-Conscious Wavefront Scheduling

Our New Progress on Frequent/Sequential Pattern Mining

Frequent-Pattern Tree

Cache-Conscious Data Placement

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Chapter 4 – Frequent Pattern Mining

Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms

Mining Compressed Frequent-Pattern Sets

A Systematic Literature Review of Frequent Pattern Mining Techniques

Frequent Pattern Mining

Cache-Conscious Wavefront Scheduling

Mining Compressed Frequent-Pattern Sets

Frequent-Pattern Tree

Our New Progress on Frequent/Sequential Pattern Mining

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees