1 / 26

Cache-conscious Frequent Pattern Mining on a Modern Processor

Cache-conscious Frequent Pattern Mining on a Modern Processor Amol Ghoting , Gregory Buehrer, and Srinivasan Parthasarathy Data Mining Research Laboratory, CSE The Ohio State University Daehyun Kim, Anthony Nguyen, Yen-Kuang Chen, and Pradeep Dubey Intel Corporation. Roadmap.

naava
Download Presentation

Cache-conscious Frequent Pattern Mining on a Modern Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache-conscious Frequent Pattern Mining on a Modern Processor Amol Ghoting, Gregory Buehrer, and Srinivasan Parthasarathy Data Mining Research Laboratory, CSE The Ohio State University Daehyun Kim, Anthony Nguyen, Yen-Kuang Chen, and Pradeep Dubey Intel Corporation

  2. Roadmap • Motivation and Contributions • Background • Performance characterization • Cache-conscious optimizations • Related work • Conclusions

  3. Motivation Note: Experiment conducted on specialized hardware • Data mining applications • Rapidly growing segment in commerce and science • Interactive  response time is important • Compute- and memory- intensive • Modern architectures • Memory wall • Instruction level parallelism (ILP) FP-Growth 2.4x 1.6x SATURATION

  4. Contributions • We characterize the performance and memory access behavior of three state-of-the-art frequent pattern mining algorithms • We improve the performance of the three frequent pattern mining algorithms • Cache-conscious prefix tree • Spatial locality + hardware pre-fetching • Path tiling to improve temporal locality • Co-scheduling to improve ILP on a simultaneous multi-threaded (SMT) processor

  5. Roadmap • Motivation and Contributions • Background • Performance characterization • Cache-conscious optimizations • Related work • Conclusions

  6. Frequent pattern mining (1) • Finds groups of items that co-occur frequently in a transactional data set • Example: • Minimum support = 2 • Frequent patterns: • Size 1: (milk), (bread), (eggs), (sugar) • Size 2: (milk & bread), (milk & eggs), (sugar & eggs) • Size 3: None • Search space traversal: breadth-first or depth-first Customer 1: milk bread cereal Customer 2: milk bread eggs sugar Customer 3: milk bread butter Customer 4: eggs sugar

  7. Frequent pattern mining (2) • Algorithms under study • FP-Growth (based on the pattern-growth method) • Winner of the 2003 Frequent Itemset Mining Implementations (FIMI) evaluation • Genmax (depth-first search space traversal) • Maximal pattern miner • Apriori (breadth-first search space traversal) • All algorithms use a prefix tree as a data set representation

  8. Roadmap • Motivation and Contributions • Background • Performance characterization • Cache-conscious optimizations • Related work • Conclusions

  9. Setup • Intel Xeon processor • At 2Ghz with 4GB of main memory • 4-way 8KB L1 data cache • 8-way 512KB L2 cache on chip • 8-way 2MB L3 cache • Intel VTune Performance Analyzers to collect performance data • Implementations • FIMI repository • FP-Growth (Gosta Grahne and Jianfei Zhu) • Genmax (Gosta Grahne and Jianfei Zhu) • Apriori (Christian Borgelt) • Custom memory managers

  10. Execution time breakdown Support counting in a prefix tree Similar to Count-FPGrowth ()

  11. Operation mix Note: Each column need not sum up to 1

  12. Memory access behavior Count-FPGrowth() Count-GM() Count-Apriori() L1 hit rate 89% 87% 86% L2 hit rate 43% 42% 49% L3 hit rate 39% 40% 27% L3 misses per instruction 0.03 0.03 0.04 CPU utilization 9% 9% 8%

  13. Roadmap • Motivation and Contributions • Background • Performance characterization • Cache-conscious optimizations • Related work • Conclusions

  14. FP-tree r r r a:1 a:2 a:4 f:1 c:1 c:1 c:2 c:3 b:1 b:1 f:1 f:3 f:2 p:1 m:1 m:2 m:1 b:1 b:1 Minimum support = 3 ITEM p:1 p:1 p:2 m:1 m:1 COUNT • Index based on largest common prefix Typically results in a compressed data set representation Node pointers allow for fast searching of items NODE POINTER CHILD POINTERS PARENT POINTER

  15. FP-Growth r a:4 f:1 c:1 • Basic step: • For each item in the FP-tree, build conditional FP-tree • Recursively mine the conditional FP-tree c:3 b:1 b:1 f:3 p:1 • Dynamic data structure and only two node fields are used through the bottom up traversal • Poor spatial locality • Large data structure • Poor temporal locality • Pointer de-referencing • Poor instruction level parallelism (ILP) m:2 b:1 ITEM COUNT p:2 m:1 NODE POINTER CHILD POINTERS r Conditional FP-tree for m: PARENT POINTER a:3 Mine conditional FP-tree c:3 We process items p, f, c, a, b similarly f:3

  16. Cache-conscious prefix tree • Improves cache line utilization • Smaller node size + DFS allocation • Allows the use of hardware cache line pre-fetching for bottom-up traversals r Header lists (for Node pointers) a f c a o b o o o c c b b o o f ITEM o o m o o PARENT POINTER p o o f p COUNT CHILD POINTERS Count lists (for Node counts) m b NODE POINTER o a p m o o o b o o c o o f o o m o o p DFS allocation

  17. Path tiling to improve temporal locality r • DFS order enables breaking the tree into tiles based on node addresses • These tiles can partially overlap • Maximize tree reuse • All tree accesses are restructured to operate on a tile-by-tile basis Tile 2 Tile N-1 Tile N Tile 1

  18. Improving ILP using SMT Same data but different computation Thread1 Thread2 r • Simultaneous multi-threading (SMT) • Intel hyper-threading (HT) maintains two hardware contexts on chip • Improves instruction level parallelism (ILP) • When one thread waits, the other thread can use CPU resources • Identifying independent threads is not good enough • Unlikely to hide long cache miss latency • Can lead to cache interference (conflicts) • Solution: Restructure multi-threaded computation to reuse cache on a tile-by-tile basis Tile 2 Tile N-1 Tile N Tile 1

  19. Speedup for FP-Growth (Synthetic data set) 4000 4500 5000 5500

  20. Speedup for FP-Growth(Real data set) For FP-Growth, L1 hit rate improves from 89% to 94% L2 hit rate improves from 43% to 98% Speedup Genmax – up to 4.5x Apriori – up to 3.5x 50000 58350 66650 75000

  21. Roadmap • Motivation and Contributions • Background • Performance characterization • Cache-conscious optimizations • Related work • Conclusions

  22. Related work (1) • Data mining algorithms • Characterizations • Self organizing maps • Kim et al. [WWC99] • C4.5 • Bradford and Fortes [WWC98] • Sequence mining, graph mining, outlier detection, clustering, and decision tree induction • Ghoting et al. [DAMON05] • Memory placement techniques for association rule mining • Considered the effects of memory pooling and coarse grained spatial locality on association rule mining algorithms in a serial and parallel setting • Parthasarathy et al. [SIGKDD98,KAIS01]

  23. Related work (2) • Data base algorithms • DBMS on modern hardware • Ailamaki et al. [VLDB99,VLDB2001] • Cache sensitive search trees and B+-trees • Rao and Ross [VLDB99,SIGMOD00] • Prefetching for B+-trees and Hash-Join • Chen et al. [SIGMOD00,ICDE04]

  24. Ongoing and future work • Algorithm re-design for next generation architectures • e.g. graph mining on multi-core architectures • Cache-conscious optimizations for other data mining and bioinformatics applications on modern architectures • e.g. classification algorithms, graph mining • Out-of-core algorithm designs • Microprocessor design targeted at data mining algorithms

  25. Conclusions • Characterized the performance of three popular frequent pattern mining algorithms • Proposed a tile-able cache-conscious prefix tree • Improves spatial locality and allows for cache line pre-fetching • Path tiling improves temporal locality • Proposed novel thread-based decomposition for improving ILP by utilizing SMT • Overall, up to 4.8-fold speedup • Effective algorithm design in data mining needs to take into account modern architectural designs.

  26. Thanks • We would like to acknowledge the following grants • NSF: CAREER-IIS-0347662 • NSF: NGS-CNS-0406386 • NSF: RI-CNS-0403342 • DOE: ECPI-DE-FG02-04ER25611

More Related