150 likes | 312 Views
A S calable A ssociation R ules M ining A lgorithm B ased on S orting, I ndexing and T rim m ing. C huang -K ai C hiou , J udy C. R T seng. Proceedings of the Sixth International Conference on Machine Learning and Cybernetics Hong Kong, 19-22 August 2007. Outline. Introduction
E N D
A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming • Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International Conference on Machine Learning and Cybernetics Hong Kong, 19-22 August 2007
Outline • Introduction • Apriori Algorithm • DHP Algorithm • MPIP Algorithm • SIT Algorithm • Experiment and Evaluation • Conclusion and Future works
Introduction • Apriori algorithm • Large amount of candidate itemsets will be generated. • Several hash-based algorithms use hash functions to filter out potential-less candidate itemsets. • DHP algorithm • MPIP algorithm • SIT algorithm • Using the sorting, indexing, and trimming techniques to reduce the amount of itemsets to be considered. • Utilizing both the advantages of Apriori and MPIP algorithm.
Apriori Algorithm Database D L1 C1 Scan D C2 C2 L2 Scan D L3 C3 Scan D
DHP Algorithm Database
MPIP Algorithm(1/2) • MPIP employs the minimal perfect hashing function for mining L1and L2. • It copes with the collision problem which occurred in DHP. • The time needed for scanning and searching data items can be reduced. • It employs the Apriori algorithm for finding the frequent k-itemsets for k>2.
SIT Algorithm(1/5) • For mining association rules, we propose a revised algorithm, Sorting-Indexing-Trimming (SIT) approach. • SIT approach can avoid generating potential-less candidate itemsets and enhance the performance via Sorting, Indexing and Trimming.
SIT Algorithm(2/5) • Sorting (1) There is the original transaction database. (2) Count the occurred frequency. (3) Sort the items by the counts in increasing order and build a mapping table. (4) Translate the items into mapping numbers. (5) Re-sort the item ordering in each transaction.
SIT Algorithm(3/5) • Indexing Apriori Indexing Index Table Comparing count=69
SIT Algorithm(4/5) • Trimming • If the minimum support is 3, all the items with frequency less than 3 will be trimmed. • For reserving the data, physical trimming will be avoided. • We just record the starting position, and generate the hash table from this position. L1
SIT Algorithm(5/5) • The processes of SIT algorithm • For finding L1and L2: • Employ the Sorting, Indexing and Trimming techniques to the original database. • Employ MPIP algorithm to find L1 and L2 • For finding the k-itemsets for k>2: • Employ Apriori algorithm to database which has been sorted, indexed and trimmed. • Find out the frequent itemsets.
Experiment and Evaluation(1/2) • The experiments are focus on two parts : • Performance of Apriori, SI+Apriori, MPIP, and SIT. • Performance of SIT and MPIP under different transaction qualities and length. • Performance of Apriori, SI+Apriori, MPIP, and SIT.
Experiment and Evaluation(2/2) • Performance of SIT and MPIP under different transaction qualities and length. • The time of pre-sorting and pre-indexing are taken into consideration in SIT2.
Conclusion and Future works • SIT reduces the amount of candidate itemsets, and also avoids generating potential-less candidate itemsets. • The performance of SIT is better than Apriori, DHP and MPIP. • Someproblems still need to be dealt with: • When the data sets are increasing, we need to sort and index again for association rule mining. • Mapping itemsinto corresponding index number is time-consuming for the long transaction length.