180 likes | 191 Views
This paper discusses Toivonen's approach to sampling large databases for association rules. It covers the algorithm, analysis, and experimental results.
E N D
Sampling Large Databases for Association Rules(Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007
Outline • Introduction • Preliminaries • Definitions, and Problem Statement • Two General Approaches • Sampling Method for Mining Association Rules • The algorithm • Analysis • Experimental Results
Introduction • Problem: Discovery of Association Rules • Domain: Very Large Databases • Bottleneck: Time • Main Memory Processes: Ignorable • Disk I/O: An Influential Factor • Suggestion: Minimize the Number of Scans of the Database Only One Full Pass Over the Database
Introduction(Con’t)Overview of Toivonen’s Method Main Steps: • Pick a random sample from the database. • Use the sample to determine all probable association rules. • Verify the results with the rest of the database, i.e. Eliminated incorrectly detected association rules and add missing association rules. The Main Contribution: To show that all exact frequencies can be found efficiently, by analyzing first a random sample and then the whole database with the proposed method.
Preliminaries • Items • I={I1,I2,…,Im} • Transactions • r={t1,t2, …, tn}, tj I • Support of an itemset • Percentage of transactions which contain that itemset. • Frequent Itemsets • Association Rules • Strong Association Rules
Preliminaries • Association Rule: implication X Y where X,Y I and X Y = Ø; • Support of Association Rule X Y: Percentage of transactions that contain X Y • Confidence of Association Rule X Y: Ratio of number of transactions that contain X Y to the number that contain X • Problem: Find the strong association rules of a given set I with respect to threshold min_fr and confidence min_conf.
Algorithms for Mining Association Rules • Level-wise Algorithms Idea: If a set is not frequent then its supersets can not be frequent. On level k, candidate itemsets X of size k are generated such that all subsets of X are frequent. • Partition Algorithm Idea: Partition the data to sections small enough to be handled in main memory. First Pass: Find locally frequent Itemsets. Second Pass: Union of the local frequent itemsets
Sampling for Frequent Sets • Major Steps • Random sampling • Finding the frequent itemsets of the sample • Finding other probable candidates using the concept of Negative Border • Using the rest of the database to check the candidates
Negative Border • All sets which are not in our frequent itemsets, but all their subsets are. minimal itemsets not in S, where S is the collection of frequent itemsets • Example: • S = {{A}, {B}, {C}, {F}, {A,B}, {A,C}, {A,F}, {C,F}, {A,C,F}} • = {{B, C}, {B, F}, {D}, {E}}
Frequent Set Discovery • Intuition: Given a collection S of sets that are frequent, the negative border contains the closest itemsets that could be frequent too. • After finding the collection of frequent itemsets, S, we check negative border of S: • If no frequent items are added=> We can conclude that all frequent sets are already found. (Why?) • Decrease minimum support to increase the chance of success. • If at least one frequent itemset is found in negative border => We can conclude that some of its supersets may be frequent.(Why?) • In the case of failure, we can either report failure and stop, or scan the database again and check the supersets to find the exact result. Success Failure
Failure Handling • In the fraction of cases where a possible failure is reported, all frequent sets can be found by making a second pass over the database: The algorithm simply computes the collection of all sets that could possibly be frequent.
Analysis of Sampling • Sample Size and Probability of Failure
Conclusion • Advantages: Reduced failure probability, while keeping candidate-count low enough for memory • Disadvantages: Potentially large number of candidates insecond pass
References [1] H. Toivonen, Sampling Large Databases for Association Rules, Proc. of VLDB Conference, India, 1996.