230 likes | 397 Views
MINING ASSOCIATION RULES FROM LARGE DATABASES USING THE LATTICE-BASED APPROACH AND HYBRID SEARCH METHOD. Arif Djunaidy Rully Soelaiman Daning Tyaspamadya. Faculty of Information Technology ITS - Surabaya. Background - 1.
E N D
MINING ASSOCIATION RULES FROM LARGE DATABASES USING THE LATTICE-BASED APPROACH AND HYBRID SEARCH METHOD Arif DjunaidyRully SoelaimanDaning Tyaspamadya Faculty of Information Technology ITS - Surabaya
Background - 1 • In data mining, association rules represent relationships that may exist among items in their transactional databases • Since, the association rules that can be exploited may represent the customers’ behavior, identification of the frequent itemsets and the formation of the conditional implication rules among items are paramount important to perform • Efficient algorithms capable of optimizing those overheads in mining meaningful association rules are therefore required • However, for large databases, the extraction of a set of meaningful association rules may require substantial memory and database scanning that may in turn increase the overall computing time of the mining process
Background - 2 • The task of discovering all frequent associations in very large databases is quite challenging • The search space is exponential in the number of database attributes • With millions of database objects, the problem of I/O minimization becomes paramount • Most current approaches are iterative in nature, requiring multiple database scans • Most approaches use very complicated data internal data structures, which have poor locality and add additional space and computation overheads
Key Features of Our Approach • All frequent itemsets are enumerated via simple “tid-list” intersections • A lattice-theoretic approach is used to decompose the original search space (lattice) into smaller pieces (sub-lattices) that can be processed independently and easier • The hybrid search strategy for enumerating the frequent itemsets within each sub-lattice • Our approach is designed to involve only a few database scans to minimize the I/O costs
Problem Statement - 1 • An association rule can be written as A B, where • A is an itemset called the antecedentor left-hand side(LHS), and • B is an itemset called the consequent or right-hand side (RHS) • The association mining task is to discover a set of association rules among a large number of objects in a given database
Problem Statement - 2 • The basic and fundamental task of the mining association rules application is to generate all association rules X Y (X, Y are itemsets) that can be extracted from the database. These rules must satisfy both the support and confidence constraints • Support constraint : Sup (XY), • Confidence constraint: Sup (XY) / Sup (X) • Sup(X), is defined as the number of transactions in which it occurs as a subset • An itemset is categorized as a frequent itemset if its support is more than a minimum support (MinSup) supplied by a user • The confidence factor represents the conditional probability that a transaction contains Y (given that the transaction contains X) • An association rule is said to be confident if its confidence factor value is more than the minimum confidence (MinCof) supplied by the user.
Simple Example - 1 • Consider the sales database of food store, where the objects represent customers and itemsets represent food • In this example, the discovered patterns are the set of food frequently bought together by the customers. • An example pattern found could be that, “60 percent of the customers who buy cereal also buy milk” • The store can then use this knowledge for shelf placement, controlling the stock, etc. • There are many potential application areas for association rule technology, which include catalog design, customer segmentation, store layout, and so on
Simple Example - 2 MinSup = 50% MinCof = 100%
The Lattice-Based Approach - 1 • We use the “Lattice-Theoretic” to: • Identify all frequent itemsets • Count the “support” of association rules • Pre-req: Construct the “tid-list” from the transaction database
The Lattice-Based Approach - 2 Maximal freq. itemsets MinSup = 50% • Construct the “powerset” Lattice P(I)
The Lattice-Based Approach - 3 • Compute support of iternsets via tid-list intersections
Hybrid Search for Freq. Itemsets - 1 • Hybrid Search used to quickly enumerate all frequent itemsets • Hybrid Search combines both the top-down and bottom-up search strategies and is based on the intuition that the greater the support of a frequent itemset, the more likely it is to be a part of a longer frequent itemset • The hybrid approach is divided in two main steps: • Initial phase containing the atoms rearrangement, and • The hybrid process itself for generating all frequent itemsets. In the second step, the recursion process is repeated until no more frequent itemset can be generated
Hybrid Search for Freq. Itemsets - 2 • The first step simply rearranges the atoms in descending order of their supports. The sorting algorithm is involved in this step • The second step starts by intersecting a pair of atoms one at a time • The intersection process is started from a pair of atoms each of which having the largest support among others to produce a larger and longer frequent itemset. • The process stops when an extension becomes infrequent (i.e., itemset that does not satisfy the minimum support requirement). • The second bottom-up phase is then entered
Hybrid Search for Freq. Itemsets - 3 Infrequent Itemsets (MinSup = 50%) Infrequent Itemsets
Test Data Statistics of Test Data
Experimental Results - 1 Number of k-itemsets
Experimental Results - 2 Number of Association Rules
Experimental Results - 3 Computing Time
Experimental Results - 4 Support Counting Performance
Experimental Results - 5 Comparison Results
Conclusions • Experimental results show that the use of this approach as well as the hybrid search method can speed-up the computing time compared to both apriori-based algorithms as well as the similar lattice-based approach that uses the bottom-up search strategy • Another interesting advantage of using the lattice-based algorithm is concerned with time used for scanning the databases. In this context, the lattice-based algorithms requires a single database scan once only. Hence, the I/O overhead can be maximally minimized • As far as the computing speed is concerned, it seems that substantial computing time are still required to execute large databases. Although, the lattice-approach is relatively powerful, it indicates that some other computing methodologies, such as the parallel algorithms using the distributed computing environments need to be considered to solve the computing speed problem