480 likes | 661 Views
Mining Association Rules with Constraints. Wei Ning Joon Wong COSC 6412 Presentation. Outline. Introduction Summary of Approach Algorithm CAP Performance Analysis Conclusion References. Outline. Introduction Summary of Approach Algorithm CAP Performance Analysis Conclusion
E N D
Mining Association Rules with Constraints Wei Ning Joon Wong COSC 6412 Presentation
Outline • Introduction • Summary of Approach • Algorithm CAP • Performance Analysis • Conclusion • References
Outline • Introduction • Summary of Approach • Algorithm CAP • Performance Analysis • Conclusion • References
Introduction • Recall mining association rules • Association rules mining finds interesting association or correlation relationships among a large set of data items.
Some problems we met during mining association rules • Overwhelming? • Not what you want? • Wait so long? • Lack of Focus
Introduction(cont.) • Example in walmart • Suppose a manager want to find which is the most popular shoes in winter?
Outline • Introduction • Summary of Approach • Algorithm CAP • Performance Analysis • Conclusion • References
Mining frequent itemsets vs. Mining association rules • Mining frequent itemsets is almost the same as Mining association rules
Constrained Mining • A naive solution • First find all frequent sets, and then test them for constraint satisfaction • Our approach: • Analyze the properties of constraints comprehensively • Push them as deeply as possible inside the frequent pattern computation.
Frequent Itemsets & Constraints TDB (min_sup=2) • Given a transaction database • Frequent itemset: a subset of items frequently appear in transactions, e.g. {a, c} • Constraint: a predicate over itemsets • C(I): sum(I)>50 • C(abd)= true
Mining Frequent Itemsets With Constraints • Given • A transaction database TDB • A support threshold min_sup • A constraint C • Find the complete set of frequent itemsets satisfying the constraint • Use constraint to • Express user’s focus • Improve both effectiveness and efficiency
Classification of Constraints • We have the following classification of constraints • Anti-monotone • Monotone • Succinct • Convertible • Convertible anti-monotone • Convertible monotone • Strongly convertible • Inconvertible
Anti-Monotone • Definition 1 (Anti-Monotone): A 1-var constraint C is anti-monotone if for all sets S, S’: S S’ & S satisfies C S’ satisfies C. • Simply, when an intemset S violates the constraint, so does any of its superset
Is Min(S) v anti-monotone? S={5, 10, 14}, v = 7 • Min(S) 7 {5} violates it. Superset {5}: {5, 10}, {5, 14}, {5, 10 , 14} So does {5, 10}, {5, 14}, {5, 10 , 14} Min(S) v is anti-monotone
Succinct • Definition 2 (Succinct) • I Item is a succinct setif it can be expressed as p(Item) for some selection predicate p. • SP 2Item is a succinct powerset if there is a fixed number of succinct sets Item1, … Itemk Item such that SP can be expressed in terms of the strict powersets of Item1,…,Itemk, using union and minus. • Finally, a 1-var constraint C is succinct provided SATc(Item) is a succinct powerset.
Succinct • General idea: we can enumerate all and only those sets that are guaranteed to satisfy the constraint. • If a constraint is succinct, we can directly generate precisely the sets that satisfy it.
Succinct example • Itemset containing a or b • Itemset containing some item with value more than 30
Succinct example • C1 Item.Price 100 • Item 1 = Item.price 100(Item)={a,b} • 2Item1={{a}, {b}, {a, b}} • SATc1 = {{a}, {b}, {a, b}} • SATc1 = 2Item1 • C1 is succinct
Convertible • Convert tough constraints into anti-monotone or monotone by properly order items
Convertible • Definition: • R is an order of items • Convertible anti-monotone • Itemset X satisfies constraint so does every prefix of X w.r.t. R
Convertible example • constraint C: avg(X) 25 • Order items in value-descending order • <a, f, g, d, b, h, c, e> • Itemset afd satisfies C • So do prefixes a and af • Thus, it becomes • Anti-monotone!
Optional Proof of min(S) v is Anti-monotone • According to the table, min(S) v is both anti-monotone and succinct. • I only proof anti-monotone here due to time limitation. • Something special…
Constraint Classification Monotone Antimonotone Strongly convertible Succinct Convertible anti-monotone Convertible monotone Inconvertible
Summary of ApproachRecapitulation • Basic idea about mining frequent itemsets with constraints. • Introduce several important constraints.
Outline • Introduction • Summary of Approach • Algorithm CAP • Performance Analysis • Conclusion • References
Algorithms • There are many algorithms in solving constrained based association rules mining. • Algorithm Direct • Algorithm MultiJoins & Reorder • Algorithm Apriori† • Algorithm Hybrid(m) • Algorithm CAP (Main Focus)
Design of Algorithm • Sound • An algorithm is sound provided it only finds frequent sets that satisfy the given constraints. • Complete • An algorithm is complete provided all frequent sets satisfying the given constraints are found.
Algorithm Apriori† • Main idea : Use Apriori Algorithm to get the frequent item sets. Then apply the constraints on the item sets found. • Step 1) Apriori with Cfreq • Step 2) Apply C – Cfreq to get final Ans
Algorithm Apriori† (Pseudocode) 1. C1 consists of sets of size 1; k = 1; Ans = ; 2. While (Ck not empty) { 2.1 conduct db scan to form Lk from Ck; 2.2 form Ck+1 from Lk based on Cfreq; k++; } 3. For each set S in some Lk: Add S to Ans if S satisfies (C – Cfreq).
The Apriori† Algorithm — An Example L1 Database TDB C1 1st scan C2 C2 2nd scan L2 L3 C3 3rd scan
The Apriori† Algorithm — An Example (cont.) L1 Constraint : {A, C, E} T.Item Database TDB L2 L3
Algorithm CAP • Succinct and Anti-monotone • Strategy I: Replace C1 in the Apriori Algorithm by C1C. • Anti-monotone but non-succinct • Strategy II: Define Ck as in the Apriori Algorithm. Drop a set S Ck from counting if S fails C, i.e., constraint satisfaction is tested before counting is done.
Algorithm CAP (cont.) • Succinct but non-anti-monotone • Strategy III: Too Complicated. To be discussed later… • Non-succinct & non-anti-monotone • Strategy IV: Induce any weaker constraint C1 from C. Depending on whether C1 is anti-monotone and/or succinct, use one of the strategies I-III above for the generation of frequent set.
Algorithm CAP (Pseudocode) 1 if CsamCsucCnoneis non-empty, prepare C1 as indicated in Strategies I, III, and IV; k = 1; 2 if Csucis non-empty { 2.1 conduct db scan to form L1 as indicated in Strategy III; 2.2 form C2 as indicated in Strategy III; k = 2;} 3 while (Ck not empty) { 3.1 conduct db scan to form Lk from Ck; 3.2 form Ck+1 from Lk based on Strategy III if Csuc is non-empty, and Strategy II for constraints in Cam;} 4. if Cnoneis empty, Ans = ULk. Otherwise, for each set S in some Lk, add S to Ans iff S satisfies Cnone.
The Algorithm CAP — An Example Constraints : {A, C, E} T.Item & min support count = 2 Question : Which strategy should we apply? Database TDB
The Algorithm CAP — An Example (Cont.) L1 Database TDB Apply Strategy I!!! C1 1st scan C2 2nd scan C2 L2 C3 Because {A, E} is pruned earlier
min (S) < 5 {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {1} {2} {3} {4} Apriori Case 3 : Succinct but not anti-monotone. Revisit… {1} {2} {3} {4} {1,2} {2,3}………{3,4} ……… {1,2,3,4} Some possible frequent sets may be lost: e.g. {1,8} {1,2,10} **Information extracted from past presentation.
Case 3 : Succinct but not anti-monotone. Continue… • Algorithm Direct • Idea : Play it safe. Generate Cck+1 by using Lck x F where F is the set of all frequent items. • Algorithm MultiJoins • Algorithm Reorder
Outline • Introduction • Summary of Approach • Algorithm CAP • Performance Analysis • Conclusion • References
Performance Analysis (Specification) • Programs written in C • Generate transactional databases using program from IBM Almaden Research Center • 100,000 records, domain of 1,000 items • Page size 4KB • SPARC-10 environment
Performance Analysis (Terminology) • Speedup • Comparison of execution time between two algorithms. • Item Selectivity • x% of them items satisfying the constraints. • Support Threshold • *Low support threshold means more frequent set to process.
Performance Analysis • Note: Support threshold set at 0.5%. • For 10% selectivity, CAP runs 80 times faster than Apriori†! • For 30% selectivity, the speedup is about 10 times.
Performance Analysis • Note: Item Selectivity fixed at 30%. • Support threshold goes up, frequent item set goes down, Apriori† improves. • CAP still at least 8 times faster.
Performance Analysis • Each entry is of the form a/b • a is the # of frequent set satisfying the constraint. • B is the total number of frequent set. • For L4 with support of 0.2%, Apriori† finds 1250 frequent sets where 8 of which is found by CAP.
Conclusion • The idea of anti-monotonicity, succinctness, and convertible are introduced in the paper. • Sound, complete, and efficient algorithms are introduced for the constraint based association rule mining.
Reference • R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD’97. • R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. SIGMOD’98. • J. Pei and J. Han. Can we push more constraints into frequent pattern mining? KDD’00.