Mining Confident Minimal Rules with Fixed-Consequents

Mining Confident Minimal Rules with Fixed-Consequents Imad Rahal, Dongmei Ren, Weihua (Allan) Wu, and William PerrizoComputer Science & Operations Research Department North Dakota State University

Association Rule Mining • A sub-branch under the broad umbrella of data mining • Initially proposed in the context of MBR by Agrawal from IBM Almaden • The process of extracting interesting, usseful and actionable associations and/or correlation relationships among large sets of data items. • From data • if-then statements • Probabilistic in nature • strength could be measured

An association rule defines a relationship of the form: • A C (if A then C) • A is the antecedent and C the consequent • Given a set of transactions D containing items form some itemspace I • “find all strong association rules with support >s and confidence >c.”

APRIORI • First successful attempt to do ARM • APRIORI is a two-step process: • Find all frequent itemsets • Satisfy minimum support threshold • Most computationally intensive • Generate strong association rules from the frequent itemsets • Satisfy minimum confidence threshold

Title • Mining ConfidentMinimal Rules with Fixed-Consequents • Important keywords in the title are • Confident • Minimal • Fixed-Consequent

Confident • Confidence-based • No support pruning • No elimination of the importance of support • very low that it is impractical to be used as a base for pruning • “blow-up” in the rule space due to the inapplicability of the downward closure property of support: • No itemset I can be frequent unless all of its subsets are also frequent

Confidence gives us trust in a rule • A 95% confident induces an error rate of 5% when generalized • Thus we always want high confidence (so as to tolerate less error) • Unlike confidence, the support fluctuates depending on the dataset we operate on • Cohen et al (2001) argues that rules with high support are obvious and uninteresting

For MBR data store managers always like to see high support values for their rules • More statistical significance to the result • Analysis patient records databases for combination of attributes values associated with specific diseases • Repetitive to detect strong patterns early on • Supportvalues expected to be (and hoped to be) relatively low

A number of confidence-based ARM approaches have been devised to mine rules of item pairs that match some confidence threshold (Fujiwara and Ullman , 2000) • Our approach would be an extension to those • Minimal instead of singleton • Other approaches uses variants of support (Bayardo (1999) – Dense-Miner)

Minimality • A rule, R, is said to be minimal if there exists no other rule, S, such that • R and S are confident, • R and S have the same consequent, • and the antecedent of R is a superset of that of S • In some applications, non-minimal rules don’t add much knowledge • R1:“formula milk”  “diapers” with highly enough confidence • R2:“formula milk”, “baby shampoo”  “diapers”.

Support of a minimal rule forms an upper bound on all derived non-minimal rules • highest support rules without having the user to specify a minimum support threshold • Minimal antecedents • Ordonez (1999)…medical data • Becquet (2002)…genome analysis • Bastide (2000)

Fixed-Consequent • The consequent of all rules is pre-specified by the user • Very well motivated in the literature • Sometimes used for classification • In the context of Precision Agriculture • Finding association between high yield quality other properties bands like Red, Green, Blue, NIR,… • High yield would be the fixed consequent

Approach • Set Enumerations trees (Raymon 1993) to mine all antecedents of the selected consequent • Proposed before for ARM • Transform the problem from a random subset search problem to an organized tree search problem • Depth first discovery of antecedents

Pruning conditions • Less-than-two-support pruning: If the support (IC) < 2, then eliminate I • all supersets of I will produce rules with support < 2 • downward closed • Minimality pruning: If the confidence (IC) >= minconf, then • all supersets of I might only produce non-minimal rules • downward closed

I = {1,2,3,4,5,6,7,8,9,10,11} • minconf. = 1/2 = 0.5 • C is the confidence • Nodes in red have zero or one support • Terminated nodes with confidence greater or equal to minconf are labeled with X (produce rules) { } 1 C=3/7 2 C=0 3 C=2/4 4 C=2/6 5 C=1/5 6 C=2/5 7 C=0 8 C=2/6 9 C=3/7 10 C=2/4 4 C=2/5 6 C=2/5 8 C=1/2 9 C=2/6 6 C=2/4 8 C=1/3 9 C=2/5 8 C=1/2 9 C=1/2 9 C=1/3 9 C=2/4 1011 6,911 6,811 311 1,4,911 4,611

A problem • Some non-minimal rules might be generated! • {1,9}  11 • {1,8,9}  11 • To rectify the problem, we utilize an adapted form of Tabu search

During the processing of an item I • Associate a temporary taboo list (TL) • Store all the nodes that don’t produce rules when joined with I • Before testing for a new potential rule including I, X,I  C,check if any subset X is in TLI • In Tabu search, westore moves in the space that have been visited without a desired result so as not to revisit them • Here we, store all itemsets that violate any of the pruning steps (downward closed) so as not to revisit any of their supersets • Supersets need to be pruned

Implementation • We adopt a vertical data representation • Antecedents of rules are produced in depth-first order which induces a lot of database scans for horizontal data • Faster computation of support and confidence • Compressed in the case of highly sparse data  better memory utilization • Processing based on logical operations

0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 6 b) The resulting two bit groups 4 2 0 2 d) P2 Predicate Tree (P-tree) Technology Column1 Column2 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 Mixed a) A 2-column table Pure-0 3 Pure-1 0 3 1 2 1 0 c) P1

3 0 3 1 2 1 0 0 0 0 0 1 1 0 0 6 4 1 0 2 0 0 1 2 Pure-1 trees

Every data item has is represented using a P-tree • Conf(ac) = RootCount(Pa AND Pc)/ RootCount(Pa) • Additionally all Taboo lists are represented in binary using P-trees to speed up their scan

Comparison analysis • Conducted on a P-II 400 with 128 SDRAM running Red hat Linux 9.1. • C++ was used for coding • No benchmarks • Compared with Dense-Miner (Bayardo 1999) • capable of mining association rules with fixed consequents at very low support thresholds

Fundamental differences exists between the two approaches: • Dense-Miner mines all association rules while we only mine minimal, confident rules • Dense-Miner uses a variant of support (coverage = minimum support divided by support of the fixed consequent ) as a pruning mechanism while this is not the case in our work (expect for support of less than 2) • All rules produced by our approach that have a support value greater than the minimum support threshold used for Dense-Miner will be produced by Dense-Miner also.

Results for Dense-Miner are observed with minimum coverage threshold fixed at 1% and 5%

Number of produced rules -# ranges from around 500,000 rules to less than 10 rules over both data sets -larger (smaller) number of rules produced at higher (lower) confidence thresholds

Summary • Proposed an approach based on SE-trees, Tabu search and the P-tree technology for extracting minimal, confident rules with fixed-consequent • Efficient when such rules are desired • The approach is complete in the sense that it does not miss any minimal, confident rule • Suffers in situations where the desired minimal rules lie deep in the tree • A large number of nodes and levels need to be traversed

Future direction • Finding heuristic measures • estimating the probability of rule availability along certain branches • quitting early in cases where such probability is low. • Experiment limiting how deep down the tree we go using • Fixed number of plies • Iterative deepening

Mining Confident Minimal Rules with Fixed-Consequents

Mining Confident Minimal Rules with Fixed-Consequents

Presentation Transcript

Data Mining Association Rules

Mining Association Rules

Mining Association Rules

DATA MINING - ASSOCIATION RULES-

Mining Association Rules

Mining Causal Association Rules

confident

Confident

Confident

Confident Children, Confident Communities

CONFIDENT

Data Mining Association Rules

Association Rules Mining

Confident

Mining Association Rules with Constraints

Incremental Mining Association Rules

Association Rules Mining with SQL

Mining Confident Rules Without Support Requirements

Mining Generalized Association Rules

Chapter 5 Mining Association Rules with FP Tree

Mining Negative Association Rules

Introduction to Data Mining Mining Association Rules