From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns

From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns Yabo Xu, Jeffrey Xu Yu, Guimei Liu, Hongjun Lu, Proc. of the 2002 IEEE International Conference on Data Mining (ICDM’02) Adviser:Jia-Ling Koh Speaker: Yu-ting Kung

Introduction • In this paper, the main tasks (for a multi-user environment) are: • Constructing an initial tree for a transactional database (in memory) • Mining using the tree constructed in memory • Converting in-memory tree  a disk-based tree • Loading a portion of the tree on disk into main memory for mining (mining is the same as 2)

Introduction(Cont.) • Data structures─PP-tree • A novel coded prefix-path tree • Two representations: • Memory–based pp-tree • Disk-based pp-tree • Mining algorithm─PP-Mine • Upon the memory-based pp-tree • Outperforms FP-growth

Transaction Database • Example: (min_sup threshold 2 ) ( a:3, b:1, c:3, d:3, e:3, f:1, g:2, h:1, i:1)

Node: labelled for a frequent item in F A Coded Prefix-Path Tree • PP-tree: an order tree F: a set of frequent 1-items in total order (like frequency order) Children of a node: listed following the order The rank Nof a PP-tree: (N= 5) the number of frequent 1-itemset

A Complete Prefix-Path Tree • tree (rank N): a PP-tree with nodes Node is encoded in: pre-order traversal Shaded subtree: a PP-tree

PP-tree Representations • Memory-based representation ─ PPM-tree • Disk-based representation ─ PPD-tree • Represented as • T: tree structure in disk • F: stores N frequent 1-itemset • I: index indicating the ranges of codes in disk-pages • : min_sup uesd to build PPD-tree on disk • See Figure 3 (next page)

item:count Code of range code:count PP-tree Representation-Fig3

How to built a PPD-tree? • Construction • A PPM-tree with in memory (task1) • Conversion • PPM-tree  PPD-tree • Using coding scheme

PP-Mine: Mining in-Memory • Based on two properties: (ij, ik: a single item prefix-path) ( : a prefix-path in general which are possible empty) • Property1 (push-down)

PP-Mine (Cont.) • Property 2 (push-right) • Example: Figure 4 (next page)

PP-Mine (Cont.)

PP-Mine Algorithm: Example

Experiment(1) • Data Sourse • Sparse dataset─T25I20D100K(10K items) • Dense dataset ─ T40I10D1K(101 items) • Three Algorithms to be compared • PP-Mine • FP-growth • H-Mine • Compare the only mining-phase

Experiment Result(1)

Experiment Result(2) • Data Sourse─T40I10D100K(59 items) • = 50% • Two Algorithms to be compared • PP-Mine • FP-growth • Compare • t(FP)─the time for FP-growth to construct a FP-tree • t(PP) ─the time for PP-load to load a sub PPD-tree + the timetoconstructa small PPM-tree

Experiment Result(2)

Conclusion • PP-Mine algorithm outperformsFP-tree • Reduce both I/O cost and CPU cost • PP-Mine algorithm outperforms H-mine • Minimizescountingcost

Coverage • Definition A coverage of a prefix-path-prefix is defined as all the -prefixes that contain -prefix (including -prefix itself)

From Path Tree To Frequent Patterns: A Framework for Mining Frequent Patterns