200 likes | 291 Views
Mining Confident Rules Without Support Requirements. Ke Wang Yu He D. W. Cheung F. Y. L. Chin. Association Rules. Given a table over A1,…,Ak, C Find all rules {Ai=ai} C=c of minimum confidence and minimum support Support: sup(Ai=ai)= #records containing Ai=ai
E N D
Mining Confident Rules Without Support Requirements Ke Wang Yu He D. W. Cheung F. Y. L. Chin
Association Rules • Given a table over A1,…,Ak, C • Find all rules {Ai=ai} C=c of minimum confidence and minimum support • Support: sup(Ai=ai)= #records containing Ai=ai • Confidence: sup(Ai=ai C=c)/sup(Ai=ai)
Low Support Rules • Interesting rules unknown low support • High support rules low confidence • Often, patterns are fragmented into many low support rules Find all rules above the minimum confidence
Confidence-based Pruning • Without minimum support, the classic support-based pruning inapplicable • Confident rules are neither downward closed nor upward closed • Need new strategies for pushing the confidence requirement.
Confidence-based Pruning r1: Age=young Buy=yes r2: Age=young, Gender=M Buy=yes r3: Age=young, Gender=F Buy=yes Observation 1: if r1 is confident, so is one of r2 and r3 (specialized by Gender) Observation 2: if no specialized rule of r1 is confident, r1 can be pruned
Confidence-based Pruning • Level-wise rule generation: Generate a candidate rule x c only if for every attribute A not in x c, some A-specialization of x c is confident.
The algorithm Input: table T over A1,…,Am,C, and miniconf Output: all confident rules 1. k=m; 2. Rulek= all confident m-rules; 3. while k>1 and Rulek is not empty do 4. generate Candk-1 from Rulek; 5. compute the confidence of Candk-1 in one pass of T; 6. Rulek-1 = all confident candidates in Candk-1; 7. k--; 8. return all Rulek;
Disk-based Implementation • Assumption: T, Rulek, Candk-1 are stored on disk. • We focus on • generating Candk-1 from Rulek and • computing the confidence for Candk-1. • Key: clustering T, Rulek, Candk-1 according to attributes Ai
Clustering by Hash Partitioning • hi --- the hash function for attribute Ai, i=1,…, m • Table T is partitioned into T-buckets • Rulek is partitioned into R-buckets • Candk-1 is partitioned into C-buckets • A bucket-id is a sequence of hash values involved [b1,…bk]
Pruning by Checking Bucket Ids • A tuple in a T-bucket supports a candidate in a C-bucket only if the T-bucket id matches the C-bucket id. • E.g., T-bucekt [A1.1,A2.1,A3.2] matches C-buckets [A1.1, A3.2] and [A1.1,A2.1] • A C-bucket [b1,…,bk] is nonempty only if for every other attribute A, some R-bucket [b1,…,bk,bA ] is nonempty
Hypergraph Hk-1 • A vertex corresponds to a T-bucket • An edge corresponds to a C-bucket, which contains a vertex if and only if the C-bucket matches the T-bucket • Hk-1 is in memory.
The Optimal Blocking • Assume that we can read several T-buckets each time, called a T-block. • For each T-block, we need to access the matching C-buckets from disk. • We want the optimal blocking of T-blocks so that the access of C-buckets is minimized. • This problem is NP-hard.
Heuristics • Heuristic I: The more T-buckets match a C-bucket, the higher priority such T-buckets should be in the next T-block. • Heuristic II: The more C-buckets matches a T-bucket, the higher priority this T-bucket should be in the next T-block.
C1 C2 C3 C4 T1 T2 T3 T4 T5 • (T1T2T3)(T4T5): C1,C2,C4 read twice, C3 read once • Heursitic I: (T1T2T5)(T3T4): C1,C2,C4 read once, C3 read twice • Heuristic II: (T1T3T5)(T2T4): C1,C4 read twice, C2,C3 read once.
Experiments • Synthetic datasets from “An interval classifier for database mining application”, VLDB 92. • 9 attributes, 1 class. • Default data size = 100K
Conclusion • The experiments show that the proposed confidence-based pruning is effective.