250 likes | 315 Views
This study proposes a method to compress frequent itemsets in databases, reducing redundancy and improving efficiency in pattern mining. By clustering similar patterns and selecting representative ones, the approach achieves high-quality compression while preserving information. The algorithm combines global and local strategies to discover representative patterns, balancing efficiency and performance.
E N D
Mining Compressed Frequent-Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Proceedings of International Conference on Very Large Data Bases (VLDB), 2005 報告人:吳建良
Outline • Introduction • Problem Statement and Analysis • Discovering Representative Patterns • Performance Study
Introduction • Frequent Pattern Mining • Given a database, find all the frequent itemsets • Challenges • Efficiency? • Many scaleable mining algorithms are available now • Usability? • High minimum support: common sense patterns • Low minimum support: explosive number of results
Existing Compressing Techniques • Lossless compression • Closed frequent patterns • Non-derivable frequent itemsets • Lossy approximation • Maximal frequent patterns
A Motivating Example • Closed frequent pattern • Report P1,P2,P3,P4,P5 • Emphasize too much on support • no compression • Maximal frequent pattern • Only report P3 • Only care about the expression • Loss the information of support • A desirable output: P2,P3,P4 • High-quality compression needs to consider both expression and support A subset of frequent itemsets in accident dataset
Compressing Frequent Patterns • Compressing framework • Clustering frequent patterns by pattern similarity • Pick a representative pattern for each cluster • Key Problems • Similarity measure • Quality of the clustering • Expression and support • Efficiency
Distance Measure • Let P1 and P2 are two closed frequent patterns, T(P) is the set of transactions which containP, the distance between P1 and P2 is: • Let T(P1)={t1,t2,t3,t4,t5}, T(P2)={t1,t2,t3,t4,t6}, thenD(P1,P2)=1-4/6=1/3 • D characterizes the support, but ignore the expression
Representative Patterns • Incorporate expression into Representative Pattern • The representative pattern should be able to expressall the other patterns in the same cluster (i.e., superset) • The representative pattern Pr={38,16,18,12,17} • Representative pattern is also good w.r.t. distance • D(Pr, P1) ≤ D(P1, P2), D(Pr, P2) ≤ D(P1, P2) • Distance can be computed using support only
δ-Clustering • δ-cover • A pattern P is δ-covered by another pattern P’ • P P’ and D(P, P’)δ • δ-cluster • There exists a representative pattern Pr in the cluster • For each pattern P in the cluster, P is δ-covered by Pr
Characteristics of δ-Clustering • P is δ-covered by a representation pattern Pr • Suppose sup(Pr)= , min_supM
Pattern Compressing Problem • Pattern Compression Problem • Find the minimum number of clusters (representative patterns) • All the frequent patterns are δ-covered by at least one representative pattern • NP-hard problem • Reducible from set-covering problem
Discovering Representative Patterns • RPglobal • Assume all the frequent patterns are mined • Directly apply greedy set-covering algorithm • Guaranteed bounds w.r.t. optimal solution • RPlocal • Directly mine from raw data set • Gain in efficiency, lose in bound guarantee • RPcombine • Combine above two methods • Trade-off w.r.t. efficiency and performance
RPglobal • Algorithm • Collect the complete coverage information • Find the set of representative patterns • Maximize ∣Set(RP)∣ • Example: δ=0.35,M=3, FP(M)={A, B, C, D, AB, BD} Support 2, closed pattern={B, C, AB, BD, ABC, ABD}
RPglobal(cont.) • Step1: the complete coverage information • Step2: the set of representative patterns Set(A)={A} Set(AB)={A, B} Set(ABC)={A, C, AB} Set(B)={B} Set(AC)={A, C} Set(ABD)={A, D, BD} Set(C)={C} Set(AD)={A, D} Set(D)={D} Set(BC)={C} Set(BD)={B, D} Pick Set(ABC) FP(M)={A, B, C, D, AB, BD} Pick Set(ABD) Pick Set(AB) or Set(BD)
RPlocal Pattern P={a, c} • Depth first search strategy • A pattern can only be covered by its sons or patterns visited before • Integrate pattern Compression into frequent pattern mining process (FP-growth method) • Beneficial • Without storing all of outputs • More efficient pruning methods P’s Sons Visited Patterns covering P
RPlocal (cont.) • Algorithm • FP-growth mining process with depth-first search strategy • Remember all previously discovered representative patterns • For each pattern P (Not covered yet) • Using closed_index for closed pruning checking • Select representative pattern Pr with largest coverage and covering P
RPlocal (cont.) • Example: δ=0.35, M=3, Original Dataset Reordered Dataset B:4, A:3, C:3, D:3 Reorder all itemsets Depth first search in pattern space FP-tree
RPlocal (cont.) FP-tree∣D Pr set: Set(D)={D} DC DA DB FP-tree∣DA Pr set: Set(DA)={D} Sup(DC)=1 → infrequent Pr set: Set(DAB)={D, DA, DB} DAB Pr set: Set(DAB)={D, DA}
RPlocal (cont.) FP-tree∣C Pr set: Set(DAB)={D, DA, DB} Set(C)={C} CA CB FP-tree∣CA Pr set: Set(DAB)={D, DA, DB} Set(CA)={C} Closed pruning Pr set: Set(DAB)={D, DA, DB} Set(CAB)={C, CA, CB} CAB Pr set: Set(DAB)={D, DA, DB} Set(CAB)={C, CA}
Closed pruning • In the depth-first search, all the single items is partitioned as three disjoint sets: • conditional set: current pattern • todo-set: to be expended • done-set: all the other items • Closed_index Example: current pattern CB conditional set: {C, B} todo-set: {} done-set: {D, A} D C A B CB=0 1 1 1 “A” belongs to done-set. CB is not closed
Experimental Setting • Data • frequent itemset mining dataset repository (http://fimi.cs.helsinki.fi/data/) • Accidents, Chess, Connect, Pumsb_star datasets • Implementation Environment • Pentium 4--2.6GHz with 1GB memory under Linux
Performance Study • Number of Representative Patterns Accidents dataset(δ=0.1) Chess dataset(δ=0.1)
Performance Study • Running Time Pumsb_star dataset(δ=0.1) Accidents dataset(δ=0.1)
Performance Study • Quality of Representative Patterns Accidents dataset(min_sup=0.4, δ=0.1) Accidents dataset(δ=0.2)