300 likes | 372 Views
Cartesian Contour: A Concise Representation for a Collection of Frequent Sets. Ruoming Jin Kent State University. Joint work with Yang Xiang and Lin Liu (KSU). Frequent Pattern Mining. Summarizing the underlying datasets, providing key insights Key building block for data mining toolbox
E N D
Cartesian Contour: A Concise Representation for a Collection of Frequent Sets Ruoming Jin Kent State University Joint work with Yang Xiang and Lin Liu (KSU)
Frequent Pattern Mining • Summarizing the underlying datasets, providing key insights • Key building block for data mining toolbox • Association rule mining • Classification • Clustering • Change Detection • etc… • Application Domains • Business, biology, chemistry, WWW, computer/networing security, software engineering, …
The Problem • The number of patterns is too large • Attempt • Maximal Frequent Itemsets • Closed Frequent Itemsets • Non-Derivable Itemsets • Compressed or Top-k Patterns • … • Tradeoff • Significant Information Loss • Large Size
Pattern Summarization • Using a small number of itemsets to best represent the entire collection of frequent itemsets • The Spanning Set Approach [Afrati-Gionis-Mannila, KDD04] • Exact Description = Maximal Frequent Itemsets • Our problem: • Can we find a concise representation which can allow both exact and approximate summarization of a collection of frequent itemsets?
Basic Idea {A,B,G,H}, {A,B,I,J}, {A,B,K,L} {C,D,G,H}, {C,D,I,J}, {C,D,K,L} {E,F,G,H}, {E,F,I,J}, {E,F,K,L} 9 itemsets, 36 items. Covering Picturing {{A,B},{C,D},{E,F}} Cartesian Product {{G,H},{I,J},{K,L}} 1 biclique, 6 itemsets, 12 items
Cartesian Covering Non-frequent itemsets
Problem Formulation • Cartesian product • e.g. • Cost of a Cartesian product • e.g. 1 biclique, 3 itemsets, and 5 items • Covering • e.g. How can we use Cartesian products to concisely represent a collection of frequent itemsets?
Exact and Approximate Covering Exact Representation Cost: 2 biclique, 4 itemsets, 6 items False positive: none Approximate Representation Cost: 1 biclique, 3 itemsets, 5 items False positive: {G,C},{G,D},{G,C,D}
Covering Maximal Frequent Itemsets MNOVWX CDEJKL CDEVWX MNOGHI CDEGHI PQRJKL CDESTU {{GHI}, {JKL}} ABCGHI ABCSTU {{STU}, {VWX}} {{ABC}, {CDE}} {{MNO}, {PQR}}
Problem Reformulation Given Maximal Frequent Itemsets: Exact representation Approximate representation Frequent Itemsets C1 C2 C1 C2
Minimal Biclique Set Cover Problem Ground Set: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 1, 2,3,4,6,7,8,9 5,10,11
NP-hardness • By reducing the Minimal Biclique Set Cover into our problem, we can easily prove our problem1 (exact) and problem2 (approximate) are NP-hard. • Minimal Biclique Set Cover is a Variant of the Classical Set Cover Problem Can we use the standard set-cover greedy algorithm?
Naïve greedy algorithm • Greedy algorithm: • Each time choose a biclique with the lowest price . • is the cost. • This method has a logarithmic approximation bound. • The problem? • The number of candidate bicliques are 2|X|+|Y| !!
Candidate Reduction • Assume one side of the biclique candidate is known, how to choose the other side?
Greedy Algorithm Biclique Candidate Split and sort Covering 4 Covering 3 Covering 3 Add 1st single Y-vertex Biclique Add 2nd single Y-vertex Biclique Add 3th single Y-vertex Biclique Fixed! Cheapest sub-biclique! Cost = 1; Cost = 5/7; Cost = 6/8 > 5/7
Approximation Bound of the Greedy Algorithm The greedy SubBiclique procedure can find a sub-biclique whose price is less than or equal to e/(1-e) of the price of the optimal sub-biclique (cheapest price)!
Further Reduction • Only using the IDEA1, the time complexity is still exponential . • How to reduce this further?? • Are all the combinations equally important? • No, because some are more likely to connect to the Y side. • Our solution: Frequent itemset mining!
Overall Algorithm • Step 1: Use the Frequent Itemset Mining tool to find all the (one-side maximal) biclique candidates; • Step 2: Calculate the cheapest sub-biclique for each candidate using the greedy procedure; • Step 3: Compare all the sub-bicliques, choose the cheapest one; • Step 4: if MFI totally covered, done; else go to Step 2.
Approximation Bound Our algorithm has e/(1-e) (ln (n)+1) approximation ratio with respect to the candidate set (all the sub-bicliques with one sides coming from the frequent itemset mining).
Speed-up techniques (1) • Using Closed itemsets for X and Y • Initially X and Y contain all the FI, respectively. • Using to cover MFI is similar to factorizing MFI; • MFI’s maximal factor itemsets are closed itemsets, whose number is much smaller!
Speed-up techniques (2) Dense Graph Sparse Graph TRADEOFF Frequent Itemset Supporting Transaction # Frequent itemsets is small; Valuable biclique candidates are not be fully used! # Frequent itemsets is big; Handling those candidates are too slow!
Speed-up techniques (3) • Iterative procedure • A large number of closed itemsets; • To cover MFI in one time can produce a huge number of biclique candidates; • So to cover MFI in several times ; • Support level is reduced gradually!
Experiments • Data sets:
Conclusion • We propose an interesting summarization problem which consider the interaction between frequent patterns • We transform this problem into a generalized minimal biclique covering problem and design an approximate algorithm with bound • The experimental results demonstrate the effective and efficiency of our approach
Reference [Bayardo98] Roberto J. Bayardo Jr. Efficiently mining long patterns from databases. SIGMOD98. [Pasquier99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Descovering frequent closed itemsets for association rules. ICDT99. [Calder07] Toon Calder and Bart Goethals. Non-derivable itemset mining. Data Min. Knowl. Discover. 07. [Han02] Jiawei Han, Jianyong Wang, Ying Lu and Petre Tzvetkov. Mining top-k frequent closed patterns without minimum support. ICDM02. [Xin06] Dong Xin, Hong Cheng, Xifeng Yan, and Jiawei Han. Extracting redundancy-aware top-k patterns. KDD06. [Xin05] Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressed frequent-pattern sets. VLDB05. [Afrati04] Foto Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets. KDD04. [Yan05] Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarization itemset patterns: a profile-based approach. KDD05. [Wang06] Chao Wang and Srinivasan Parthasarathy. Summarizing itemset patterns using probabilistic models. KDD06. [Jin08] Ruoming Jin, Muad Abu-Ata, Yang Xiang, and Ning Ruan. Effective and efficient itemset pattern summarization: regression-based approaches. KDD08. [Xiang08] Yang Xiang, Ruoming Jin, David Fuhy, and Feodor F. Dragan. Succinct Summarization of transactional databases: an overlapped hyperrectangle scheme. KDD08.
Related Work • K-itemset approximation: [Afrati04]. • Difference: • their work is a special case of our work; • their work is expensive for exact description; • Our work use set cover and max-k cover methods. • Restoring the frequency of frequent itemsets: [Yan05, Wang06, Jin08]. • Hyperrectangle covering problem: [Xiang08].