Compressing Frequent Patterns: Efficient Mining Approach

Mining Compressed Frequent-Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Proceedings of International Conference on Very Large Data Bases (VLDB), 2005 報告人:吳建良

Outline • Introduction • Problem Statement and Analysis • Discovering Representative Patterns • Performance Study

Introduction • Frequent Pattern Mining • Given a database, find all the frequent itemsets • Challenges • Efficiency? • Many scaleable mining algorithms are available now • Usability? • High minimum support: common sense patterns • Low minimum support: explosive number of results

Existing Compressing Techniques • Lossless compression • Closed frequent patterns • Non-derivable frequent itemsets • Lossy approximation • Maximal frequent patterns

A Motivating Example • Closed frequent pattern • Report P1,P2,P3,P4,P5 • Emphasize too much on support • no compression • Maximal frequent pattern • Only report P3 • Only care about the expression • Loss the information of support • A desirable output: P2,P3,P4 • High-quality compression needs to consider both expression and support A subset of frequent itemsets in accident dataset

Compressing Frequent Patterns • Compressing framework • Clustering frequent patterns by pattern similarity • Pick a representative pattern for each cluster • Key Problems • Similarity measure • Quality of the clustering • Expression and support • Efficiency

Distance Measure • Let P1 and P2 are two closed frequent patterns, T(P) is the set of transactions which containP, the distance between P1 and P2 is: • Let T(P1)={t1,t2,t3,t4,t5}, T(P2)={t1,t2,t3,t4,t6}, thenD(P1,P2)=1-4/6=1/3 • D characterizes the support, but ignore the expression

Representative Patterns • Incorporate expression into Representative Pattern • The representative pattern should be able to expressall the other patterns in the same cluster (i.e., superset) • The representative pattern Pr={38,16,18,12,17} • Representative pattern is also good w.r.t. distance • D(Pr, P1) ≤ D(P1, P2), D(Pr, P2) ≤ D(P1, P2) • Distance can be computed using support only

δ-Clustering • δ-cover • A pattern P is δ-covered by another pattern P’ • P  P’ and D(P, P’)δ • δ-cluster • There exists a representative pattern Pr in the cluster • For each pattern P in the cluster, P is δ-covered by Pr

Characteristics of δ-Clustering • P is δ-covered by a representation pattern Pr • Suppose sup(Pr)= , min_supM

Pattern Compressing Problem • Pattern Compression Problem • Find the minimum number of clusters (representative patterns) • All the frequent patterns are δ-covered by at least one representative pattern • NP-hard problem • Reducible from set-covering problem

Discovering Representative Patterns • RPglobal • Assume all the frequent patterns are mined • Directly apply greedy set-covering algorithm • Guaranteed bounds w.r.t. optimal solution • RPlocal • Directly mine from raw data set • Gain in efficiency, lose in bound guarantee • RPcombine • Combine above two methods • Trade-off w.r.t. efficiency and performance

RPglobal • Algorithm • Collect the complete coverage information • Find the set of representative patterns • Maximize ∣Set(RP)∣ • Example: δ=0.35,M=3, FP(M)={A, B, C, D, AB, BD} Support 2, closed pattern={B, C, AB, BD, ABC, ABD}

RPglobal(cont.) • Step1: the complete coverage information • Step2: the set of representative patterns Set(A)={A} Set(AB)={A, B} Set(ABC)={A, C, AB} Set(B)={B} Set(AC)={A, C} Set(ABD)={A, D, BD} Set(C)={C} Set(AD)={A, D} Set(D)={D} Set(BC)={C} Set(BD)={B, D} Pick Set(ABC) FP(M)={A, B, C, D, AB, BD} Pick Set(ABD) Pick Set(AB) or Set(BD)

RPlocal Pattern P={a, c} • Depth first search strategy • A pattern can only be covered by its sons or patterns visited before • Integrate pattern Compression into frequent pattern mining process (FP-growth method) • Beneficial • Without storing all of outputs • More efficient pruning methods P’s Sons Visited Patterns covering P

RPlocal (cont.) • Algorithm • FP-growth mining process with depth-first search strategy • Remember all previously discovered representative patterns • For each pattern P (Not covered yet) • Using closed_index for closed pruning checking • Select representative pattern Pr with largest coverage and covering P

RPlocal (cont.) • Example: δ=0.35, M=3, Original Dataset Reordered Dataset B:4, A:3, C:3, D:3 Reorder all itemsets Depth first search in pattern space FP-tree

RPlocal (cont.) FP-tree∣D Pr set: Set(D)={D} DC DA DB FP-tree∣DA Pr set: Set(DA)={D} Sup(DC)=1 → infrequent Pr set: Set(DAB)={D, DA, DB} DAB Pr set: Set(DAB)={D, DA}

RPlocal (cont.) FP-tree∣C Pr set: Set(DAB)={D, DA, DB} Set(C)={C} CA CB FP-tree∣CA Pr set: Set(DAB)={D, DA, DB} Set(CA)={C} Closed pruning Pr set: Set(DAB)={D, DA, DB} Set(CAB)={C, CA, CB} CAB Pr set: Set(DAB)={D, DA, DB} Set(CAB)={C, CA}

Closed pruning • In the depth-first search, all the single items is partitioned as three disjoint sets: • conditional set: current pattern • todo-set: to be expended • done-set: all the other items • Closed_index Example: current pattern CB conditional set: {C, B} todo-set: {} done-set: {D, A} D C A B CB=0 1 1 1 “A” belongs to done-set. CB is not closed

Experimental Setting • Data • frequent itemset mining dataset repository (http://fimi.cs.helsinki.fi/data/) • Accidents, Chess, Connect, Pumsb_star datasets • Implementation Environment • Pentium 4--2.6GHz with 1GB memory under Linux

Performance Study • Number of Representative Patterns Accidents dataset(δ=0.1) Chess dataset(δ=0.1)

Performance Study • Running Time Pumsb_star dataset(δ=0.1) Accidents dataset(δ=0.1)

Performance Study • Quality of Representative Patterns Accidents dataset(min_sup=0.4, δ=0.1) Accidents dataset(δ=0.2)

Compressing Frequent Patterns: Efficient Mining Approach

Compressing Frequent Patterns: Efficient Mining Approach

Presentation Transcript

Frequent Item Mining

Frequent Pattern Mining

Summarization of Frequent Pattern Mining

Frequent Structure Mining

Frequent Pattern Mining in Data Streams

Mining Frequent Patterns

Frequent Subgraph Mining

Our New Progress on Frequent/Sequential Pattern Mining

Mining Frequent Subgraphs

Mining Frequent Subgraphs

Frequent-Pattern Tree

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Cache-conscious Frequent Pattern Mining on a Modern Processor

Chapter 4 – Frequent Pattern Mining

Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms

Mining Frequent Item Sets by Opportunistic Projection

Mining Compressed Frequent-Pattern Sets

A Systematic Literature Review of Frequent Pattern Mining Techniques

Frequent Pattern Mining

Frequent-Pattern Tree

Our New Progress on Frequent/Sequential Pattern Mining

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees