200 likes | 347 Views
Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms Rohit Gupta, Gang Fang, Blayne Field, Michael Steinbach and Vipin Kumar Email: rohit@cs.umn.edu Department of Computer Science and Engineering University of Minnesota, Twin Cities. Outline of the Talk. Introduction
E N D
Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms Rohit Gupta, Gang Fang, Blayne Field, Michael Steinbach and Vipin Kumar Email: rohit@cs.umn.edu Department of Computer Science and Engineering University of Minnesota, Twin Cities Gupta et al. KDD 2008
Outline of the Talk • Introduction • Background of Algorithms • Motivation • Contributions • Evaluation Methodology • Experimental Results and Discussion • Synthetic Datasets • Effects of higher random noise • Efficiency and Scalability • Conclusions Gupta et al. KDD 2008
Introduction • Traditional association mining algorithms use strict definition of support • For an itemset I’, a transaction must have all the items in order to be counted as a supporting transaction • Real-life datasets are noisy • True patterns cannot be found at the desired level of support as they are fragmented due to noise • Too many spurious patterns at low support level Gupta et al. KDD 2008
Background • Weak ETI[Yang et al 2001] - pattern sub-matrix overall should be no more than % sparse • Strong ETI (‘GETI’) [Yang et al 2001] – Each row in the pattern sub-matrix should be no more than % sparse • Recursive Weak ETI (‘RW’) [Seppanen et al 2004] – Itemset and all its subsets should also be weak ETI • AFI (‘AFI’) [Liu et al 2006] – Introduces additional column constraint to the concept of strong ETI to reduce spurious items • AC-CLOSE (‘AC-CLOSE’) [Cheng et al 2006] – uses Apriori to find a core pattern at low support and then builds on it to find AFI type patterns Gupta et al. KDD 2008
Motivation • Comprehensive and systematic evaluation of these algorithms is lacking • Each proposed algorithm is shown to be far more superior than the previously proposed algorithms • However this may not be true since performance of various algorithms is sensitive to the input parameters and noise in the data • Different algorithms may require different set of parameters (minsup and error tolerances) to discover true patterns • Most existing evaluations have been confined to a small set of parameters often appropriate to the introduced algorithm Gupta et al. KDD 2008
Contributions • Performed comprehensive evaluation of various approximate pattern mining algorithms • Compared different algorithms based on the quality of the patterns and robustness to the input parameters • Highlighted the importance of choosing optimal parameters, which may be different for each algorithm • Proposed variations of the existing algorithms to generate higher quality patterns Gupta et al. KDD 2008
Proposed Variations of Algorithms • ‘GETI-PP’ uses an additional parameter (c) to make sure every item in each strong ETI generated by the greedy algorithm ‘GETI’ also satisfies the column constraint • ‘RW-PP’ uses an additional parameter (c) to make sure every item in each recursive weak ETI generated by the algorithm ‘RW’ also satisfies the column constraint • An itemset I’ is said to be recursive strong ETI(‘RS’) if I’ and all subsets of I’ are strong ETI Gupta et al. KDD 2008
Evaluation Methodology • Recoverability - Quantifies how well an approximate pattern mining algorithm recovers the base patterns [Kum et al 2002] • Spuriousness - Measures the quality of the found patterns [Kum et al 2002] Base Patterns – B, Found Patterns – F Recoverability = (3+3)/(|B1|+|B2|) = 6/7 Spuriousness = (0+1+2)/(|F1|+|F2|+|F3|) = 3/11 Gupta et al. KDD 2008
Evaluation Methodology (Cont.) • Significance – Balances the trade off between recoverability and spuriousness and is defined as • Redundancy – Quantify the similarity in patterns and hence the actual number of useful patterns • Robustness – As the ground truth for real dataset will not be known, algorithms that generates acceptable quality patterns for a relatively wide range of input parameters are favored Redundancy = (1+1+3) = 5 Gupta et al. KDD 2008
Experimental Set-Up • Implemented AFI, GETI, RW, GETI-PP and RW-PP • AC-CLOSE was simulated using AFI • patterns generated by AFI that also have a core block are essentially same as those generated by AC-CLOSE • Datasets used for experiments • 8 different synthetic datasets – easy to evaluate algorithms as the ground truth is known • Real zoo dataset • Only maximal patterns were used for comparison • For each noisy dataset, each algorithm is run on wide range of parameter values (minsup, r and c) • As the noise is random, we repeat the process of adding noise and running the algorithm 5 times and report average results Gupta et al. KDD 2008
Synthetic Datasets (No Noise Version) Gupta et al. KDD 2008
Results on Synthetic Dataset #8 • AC-CLOSE seems to be unstable • GETI-PP and RW-PP have more stable performance than GETI and RW • Performance of all the algorithms deteriorates as the noise increases • GETI and RW (that uses single parameter ) tend to pick more spurious items • GETI-PP and RW-PP (variations of GETI and RW) generate patterns comparable in quality to AFI Gupta et al. KDD 2008
Results on Synthetic Dataset #6 Gupta et al. KDD 2008
Optimal Parameters for Different Algorithms for Synthetic Data # 6 Gupta et al. KDD 2008
Effects of More Random Noise • In previous experiments, random noise was constrained by the smallest implanted true pattern in the data • In some real-life applications, small but true patterns can be hidden in even larger amounts of random noise • Trade off between recoverability and spuriousness becomes even more challenging • High noise experiments are computationally expensive • Scheme 1: For algorithms that could finish in time ( = 1 hour), complete parameter space is searched • Scheme 2:For algorithms that could not finish in time ( = 1 hour), parameter space is searched in order of complexity Gupta et al. KDD 2008
Results on Dataset # 8 for More Random Noise Gupta et al. KDD 2008
Efficiency and Scalability • Though AFI is computationally more efficient than GETI-PP for less noisy dataset, it becomes very expensive as the noise increases • GETI-PP and RW-PP (simple proposed variations) are computationally efficient and generate comparable quality patterns to AFI • RW-PP shows marginal increase in run-time as the noise increases from 0% to 12%, after which it only shows rapid increase • GETI-PP seems to scale well as the noise increases • All algorithms were run on a linux machine with 8 Intel(R) Xeon(R) CPUs (E5310 @ 1.60GHz)(with 10 processes) • Total run-time for 144 parameter combinations (4 minsup values and 6 different row and column tolerance values) is reported in seconds Results on synthetic data 8 Gupta et al. KDD 2008
Conclusions • Different algorithms have different sweet spots in the parameter space • If optimal parameters are provided, difference in their performance is not as large as was believed in the past • Enforcing the column tolerance c improves the performance of single parameter algorithms GETI and RW • AFI, GETI-PP and RW-PP all use the column constraint c andgenerate similar quality patterns. However, GETI-PP and RW-PP are computationally more efficient than AFI specially for very noisy datasets For source codes and datasets, visit: http://www.cs.umn.edu/~kumar/ETI Gupta et al. KDD 2008
References • C. Blake and C. Merz. Uci repository of machine learninig databases. University of California, Irvine, 2007. • H. Cheng, P. S. Yu, and J. Han. Ac-close: Efficiently mining approximate closed itemsets by core pattern recovery. In ICDM, pages 839–844, 2006. • H. Cheng, P. S. Yu, and J. Han. Approximate frequent itemset mining in the presence of random noise. In Soft Computing for Knowledge Discovery and Data Mining, pages 363–389. Oded Maimon and Lior Rokach. Springer, 2008. • H.-C. Kum, S. Paulsen, and W. Wang. Comparative study of sequential pattern mining frameworks —support framework vs. multiple alignment framework. In ICDM Workshop, 2002. • J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel, and J. Prins. Mining approximate frequent itemsets in the presence of noise: algorithm and analysis. In SDM, pages 405–416, 2006. • J. Liu, S. Paulsen, W.Wang, A. Nobel, and J. Prins. Mining approximate frequent itemsets from noisy data. In ICDM, pages 721–724, 2005. • J. Pei, A. K. H. Tung, and J. Han. Fault-tolerant frequent pattern mining: Problems and challenges. In DMKD, 2001. • J. Seppanen and H. Mannila. Dense itemsets. In KDD, pages 683–688, 2004. • M. Steinbach, P.-N. Tan, and V. Kumar. Support envelopes: A technique for exploring the structure of association patterns. In KDD, pages 296–305, New York, NY, USA, 2004. ACM Press. • P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson Addison-Wesley, May 2005. • C. Yang, U. Fayyad, and P. Bradley. Efficient discovery of error-tolerant frequent itemsets in high dimensions. In KDD, pages 194–203, 2001. Gupta et al. KDD 2008
Thank You Questions? All the source codes, datasets and results are publicly available at: http://www.cs.umn.edu/~kumar/ETI Gupta et al. KDD 2008