390 likes | 409 Views
Mining Frequent Closed Cubes in 3D Datasets. Liping Ji Kian-Lee Tan Anthony K. H. Tung. Computer Science Department National University of Singapore. Motivation. Frequent Closed Pattern (FCP) Mining: great importance, wide application Previous works all limited to 2D FCP mining
E N D
Mining Frequent Closed Cubes in 3D Datasets Liping Ji Kian-Lee Tan Anthony K. H. Tung Computer Science Department National University of Singapore
Motivation • Frequent Closed Pattern (FCP) Mining: great importance, wide application • Previous works all limited to 2D FCP mining biological data: gene-time, gene-sample market basket data: transanction-itemset • Extend the 2D FCP mining to the 3D context biological data: gene-sample-time marketing data: region-time-items
Background • Frequent Pattern (FP) and Frequent Closed Pattern (FCP) minimum support threshold: minsup=2 Itemsets t1: a1 a2 a3 a5 t2: a1 a2 a3 t3: a1 a2 a3 a4 t4: a3 a5 Transactions
Background • Frequent Pattern (FP) and Frequent Closed Pattern (FCP) minimum support threshold: minsup=2 Itemsets t1: a1 a2 a3 a5 t2: a1 a2 a3 t3: a1 a2 a3 a4 t4: a3 a5 Transactions
Background • Frequent Pattern (FP) and Frequent Closed Pattern (FCP) minimum support threshold: minsup=2 Itemsets t1: a1 a2 a3 a5 t2: a1 a2 a3 t3: a1 a2 a3 a4 t4: a3 a5 FCP Transactions FP
Background • Binary Mapping I t1: a1 a2 a3 a5 t2: a1 a2 a3 t3: a1 a2 a3 a4 t4: a3 a5 T
Background • Binary Mapping I t1: a1 a2 a3 a5 t2: a1 a2 a3 t3: a1 a2 a3 a4 t4: a3 a5 T
Frequent Closed Cube • 3D Dataset Height Slice Row Column
Frequent Closed Cube • Slices by Height Dimension h3 h1 h2
Frequent Closed Cube • Closed Cube: Maximal h3 h1 h2
Frequent Closed Cube • Closed Cube: Maximal h3 h1 h2
Frequent Closed Cube • Definition: Frequent Closed Cube (FCC) • Maximal: cannot be extended in any dimension • Frequent: satisfy minH, minR, minC threshods
Frequent Closed Cube • Definition: Frequent Closed Cube (FCC) • Maximal: cannot be extended in any dimension • Frequent: satisfy minH, minR, minC thresholds
RSM vs. CubeMiner • Representative Slice Mining (RSM) extend existing 2D FCP mining algorithms for FCC mining • CubeMiner operate on the 3D space directly
RSM • Representative Slice (RS) Generation enumerate all possible combination of slices • 2D FCP Mining from each RS • Post-pruning to Remove Unclosed Cubes If a 2D FCP is contained in other slices besides its contributing slices, it is unclosed and hence removed; otherwise, it is retained.
RSM • Slices by Height Dimension h3 h1 h2
RSM • Slices by Height Dimension h3 h1 h2
CubeMiner: Cutters Slice h1 Cutters from h1
Mining FCC: CubeMiner Splitting Tree (h1h2h3 ,r1r2r3r4, c1c2c3c4c5 ) Root h1,r1, c4
Mining FCC: CubeMiner Splitting Tree (h1h2h3 ,r1r2r3r4, c1c2c3c4c5 ) Root h1,r1, c4 Cutter Checking: A. Cutter Checking: check if the Cutter is applicable (A.) • Subset of the node: A. • Otherwise: N.A.
Mining FCC: CubeMiner Splitting Tree (h1h2h3 ,r1r2r3r4, c1c2c3c4c5 ) Root h1,r1, c4 (h2h3,r1~r4, c1~c5 ) Left Tree: remove Cutter’s left atom h1 from parent node
Mining FCC: CubeMiner Splitting Tree (h1h2h3 ,r1r2r3r4, c1c2c3c4c5 ) Root h1,r1, c4 (h2h3,r1~r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c5 ) Middle Tree: remove Cutter’s middle atom r1 from parent node
Mining FCC: CubeMiner Splitting Tree (h1h2h3 ,r1r2r3r4, c1c2c3c4c5 ) Root h1,r1, c4 (h2h3,r1~r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c5 ) (h1~h3 ,r1~r4, c1c2c3c5 ) Right Tree: remove Cutter’s right atom c4 from parent node
Mining FCC: CubeMiner Splitting Tree (h1h2h3 ,r1r2r3r4, c1c2c3c4c5 ) Root h1,r1, c4 (h2h3,r1~r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c5 ) (h1~h3 ,r1~r4, c1c2c3c5 ) h1 ,r2, c4c5 h1 ,r2, c4c5 h1 ,r2, c4c5 N.A. A. A. Next Cutter: checking
Mining FCC: CubeMiner Splitting Tree (h1h2h3 ,r1r2r3r4, c1c2c3c4c5 ) Root h1,r1, c4 (h2h3,r1~r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c5 ) (h1~h3 ,r1~r4, c1c2c3c5 ) h1 ,r2, c4c5 h1 ,r2, c4c5 (h2h3 ,r2~r4, c1~c5 ) (h1~h3 ,r3r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c3 )
Mining FCC: CubeMiner Splitting Tree (h1h2h3 ,r1r2r3r4, c1c2c3c4c5 ) Root h1,r1, c4 (h2h3,r1~r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c5 ) (h1~h3 ,r1~r4, c1c2c3c5 ) h1 ,r2, c4c5 h1 ,r2, c4c5 (h2h3 ,r2~r4, c1~c5 ) (h1~h3 ,r3r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c3 ) Subset Cube
Mining FCC: CubeMiner Splitting Tree (h1h2h3 ,r1r2r3r4, c1c2c3c4c5 ) Root h1,r1, c4 (h2h3,r1~r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c5 ) (h1~h3 ,r1~r4, c1c2c3c5 ) h1 ,r2, c4c5 h1 ,r2, c4c5 (h2h3 ,r2~r4, c1~c5) (h1~h3 ,r3r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c3)
Mining FCC: CubeMiner Splitting Tree (h1h2h3 ,r1r2r3r4, c1c2c3c4c5 ) Root h1,r1, c4 (h2h3,r1~r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c5 ) (h1~h3 ,r1~r4, c1c2c3c5 ) h1 ,r2, c4c5 Left Track Checking h1 ,r2, c4c5 (h2h3 ,r2~r4, c1~c5) (h1~h3 ,r3r4, c1~c5 ) (h1~h3 ,r2~r4, c1~c3)
Parallelism • RSM • Task: mining of each Representative Slice • CubeMiner: • Task: mining of each branch • Processor: • Initial: keep a copy of the whole dataset • Independent and concurrent with few communication cost
Mining FCC: Experiments • Real yeast cell-cycle regulated genes • Elutriation Experiments: 14*9*7161 • CDC15 Experiments: 19*9*7761 • Synthetic Data: IBM data generator • Synthetic 1: H*R*C=(8~20)*20*1000 • Synthetic 2: H*R*C=100*100*10000
Experiments: Optimize CubeMiner • Optimal: sort slices by zero decreasing order • Prune off infrequent cubes early Elutritration(14*9*7161)
Experiments: Optimize RSM • Optimal: enumerate slices by the smallest dimension • Slice enumeration takes relatively long processing time Elutritration(14*9*7161)
Experiments: RSM vs. CubeMiner With the increase of the smallest dimension, CubeMiner outperforms RSM Synthetic Data (vary size of height dimension)
Experiments: Parallelism • As the degree of parallelism increases, the response time decreases. • Optimal number of processors CDC15 (Vary Number of Processors)
Conclusion • Notion of Frequent Closed Cube • RSM: efficient when one of the dimension is small • CubeMiner: superior for large datasets • Parallel RSM and CubeMiner