390 likes | 407 Views
This paper discusses the process of pruning and summarizing discovered associations in association rule mining, focusing on Direction Setting rules. It explores the challenges of handling vast amounts of association rules and proposes a solution through identifying essential relationships in the data.
E N D
Pruning and Summarizing the Discovered Associations Bing Liu Wynne Hsu Yiming Ma National University of Singapore Tushita Agrawal, Manju Navani April 14th, 2005
Outline • Association Rule Mining • Related Work • Chi-Square Test • Direction Setting Rules • Pruning & Finding DS Rules • The Algorithm • Empirical Evaluation CS 583
Association Rules • A class of important regularities in data. • I = {i1, …, in} : a set of items. D : a set of data cases. • An association rule is an implication of the form X -> Y, where X C I, Y C I, and X n Y = Ø. • Rule X-> Y holds in D with confidence c if c% of data cases in D that support X also support Y. • The rule has support s in D if s% of the data case in D contains X U Y. CS 583
Association Rule Mining • Generating all association rules that have support and confidence greater than the user-specified min support and min confidence. • Algorithm: A 2-step process: • Finds all large itemsets that meet the min support constraint. • Generates rules form all large itemsets that satisfy the min confidence constraint. CS 583
Strengths & Drawbacks • Efficiently discovers the complete set of associations that exist in data. • These associations provide a complete picture of the underlying regularities in the domain. • The number of discovered associations can be huge. • Very difficult to be analyzed by a human user. • Even worse with those data sets whose items are highly correlated. CS 583
Solution??? • Give an arbitrary small subset of the rules to the user. • NOT SATISFACTORY!!! • Small subset can only give a partial picture of the domain. • “Can the completeness of association rule mining be preserved without overwhelming the user?” CS 583
Solution – DS Rules • Prune insignificant rules and find a special subset of association rules that represent the underlying relationships in the data. Direction setting (DS) rules. • Give a summary of the behavior of the discovered associations. • Represent the essential relationships or structure (or skeleton) of the domain. CS 583
In this paper... • Association rule mining from a relational table • Item: attribute = value • Numerical values discretized • Targeted at a specific attribute • Rule expressed as: X->y where y: a value of the target attribute, and X: a set of items from the rest of attributes • Large rule: A rule that meets the min supp. • Min conf is not used since does not reflect relationship of the domain. • Statistical correlation is the basis for rule finding. CS 583
Proposed Technique • Two steps: • Pruning the association rules • Summarizing the unpruned rules Direction setting rules Discovered large rules Pruning Significant rules Summarization Non Direction setting rules CS 583
Related Work • Template-Based Approach: • Most straightforward method for selecting interesting association rules • A template describes a set of rules in terms of the items occurring in the conditional and the consequent parts. e.g., Fruit+, Diary_product* Meat. • User specifies what he/she wants to see using templates • The system then finds only those matching rules. CS 583
Related Work (cont...) • Subjective Interestingness: • Interactive and iterative approach for finding unexpected rules. • User specifies his/her existing knowledge of the domain. • The system then finds unexpected rules by comparing user’s knowledge with the discovered rules. • If the unexpected rules identified by the system are not truly unexpected, they serve to remind the user what he/she might have forgotten. CS 583
Related Work (cont...) • Association Rule Cover Based: • A cover is a subset of the discovered associations that can cover the database. • The number of rules in a cover can be quite small. • A greedy algorithm is proposed to find a good cover and the remaining rules are pruned. • The problem with this method is that the advantage of association rules, its completeness, is lost. CS 583
Related Work (cont...) • Constraint Based Rule Mining: • Technique using Minimum Improvement. • Difference between the conf of a rule and the conf of any proper sub-rule with the same consequent. • Prune rules which do not meet the minimum improvement in conf. • Similar pruning method but chi-square test as basis of pruning. CS 583
Related Work (cont...) • Online Generation of ARs: • Technique to remove two types of redundant rules – simple and strict redundancy. • A rule R1is redundant w.r.t another rule R2 if the sup and conf of R1 are always at least as large as the sup and conf of R2. • Simple: Remove rules that are derived from same itemset. Ex: AB => C is redundant w.r.t. A=>BC • Strict: Applies to two itemsets and one is a subset of the other. Ex: X=> Y is redundant w.r.t. X=>YZ CS 583
Limitations • Do not prune insignificant rules. • Do not provide a summary the discovered rules. CS 583
Chi-Square Test Statistics (2) • To test Independence and Correlation • The closer the observed frequency is to expected frequency => greater is the weight of evidence in favor of independence • E.g. Association rule for a bank Job = yes -> Loan = approved [sup = 200/500, conf = 200/300] CS 583
Chi-Square Test Statistics(2)(cont..) • “Is loan approval dependent on whether one has a job or not?” • Hypothesis – two attributes are independent • Of 500 people • 300(60%) had a job,200(40%) had no job • 280 approved and 220 not-approved cases to be divided in same ratio. • 2 tests the significance of the deviation from the expected values CS 583
Chi-Square Test Statistics(2)(cont...) • A 2 value of 0 implies attributes are statistically independent • If higher than certain threshold then reject the independence assumption • In our case 2 = 34.63 which is much larger than the threshold(3.84 at significance 95%) • Conclusion - loan approval is correlated to whether one has a job. • How are they correlated? CS 583
Chi-Square Test Statistics(2)(cont...) s : minimum support c : significance level X -> y are said to be (s,c) • Correlated if • The rule’s support exceeds s. • The 2 value for the rule with respect to the whole data exceeds 2 at c • Uncorrelated (independent) if • The rule’s support exceeds s. • The 2 value for the rule with respect to the whole data does not exceed 2 at c CS 583
Chi-Square Test Statistics(2)(cont...) Types of correlation/direction • Positive correlation • X and y of rule r, X -> y are correlated and f0 / f > 1 then Direction of r is 1 • Negative correlation • X and y of rule r, X -> y are correlated and f0 / f < 1 then Direction of r is -1 • Independence • X and y of rule r, X -> y are independent then Direction of r is 0 The rule in previous example represents positive correlation f0 = 200 ; f = 300 * 280/500 = 168 CS 583
Chi-Square Test Statistics(2)(cont...) A generic 2 X 2 contingency table To compute the correlation type for rule r: X -> y Compare the rule with the whole population or the whole data set CS 583
Direction Setting (DS) Rules Set of expected directions of a rule r is • If r is a 1-condition rule, the set of expected direction is {0} (condition and consequent are expected to be independent) • If r is a k-condition rule r (k>1) of the form r: a1,a2,.., ak -> y • View r as a combination of 2 rules, a 1-condition rule r1: ai -> y and (k-1) condition rule rrest : a1,a2,..,aj -> y • The expected direction for this combination is CS 583
Direction Setting (DS) Rules(cont..) A rule r is a DS rule, if it satisfies • It has a direction of 1 (positive correlation) • Its direction is not an element of the set of expected directions A non-direction setting (non-DS) rule is a positively correlated rule that is not a DS rule CS 583
Direction Setting (DS) Rules(cont..) All possible combinations of r1, rrest and r r1.dir, rrest.dir := r.dir or rrest.dir,r1.dir := r.dir • C(2), D(2), E(2) and F(2) have direction of r as 1, but expected direction is 0 or Unknown. • Thus r sets a new direction Potential DS rule. • r is a DS rule if for all combinations of r1 and rrest, direction of r is different from expected directions. CS 583
Pruning Association Rules • Why? • The number of discovered associations can be huge. • Many mined rules are spurious and insignificant. • Their existence may simply be due to chance. • Ex: P: Job=yes ->Loan=approved [sup=60%,conf=90%] r: Job=yes, Credit_history=good -> Loan= approved [sup = 40%, conf = 91%] • r is insignificant with respect to P (P is more general). • r's slightly higher confidence is more likely due to chance. CS 583
Pruning Association Rules (Cont...) • So r can be pruned w.r.t. P • Instead of using the whole dataset, correlation of r is tested w.r.t. P as r only covers a subset of the data cases that are covered by P • Rule: Given a rule r, X->y, prune r using each ancestor rule P, ->y, which has the same consequent as r but fewer or 0 conditions. • How done? Perform a 2 test on r with respect to P. Positive correlation => keep r Otherwise prune r. CS 583
Finding DS Rules • Evaluate each 1-condition rule to determine its direction status (i.e. -1/0/1) • Proceed level-by-level to analyze each rule and check if it follows the direction set by previous level rules or it sets a new direction : • At level k (k >1), for each k-condition rule r, determine its direction. • Examine each combination of 1-condition rule r1 and (k-1)-condition rule rrestof r to determine whether r follows the expected direction set by r1 and rrest. • If r follows the direction set by at least one such combination, r is NOT a DS rule. If not, then r sets a new direction, and it is a DS rule. CS 583
Finding DS Rules (cont...) A (1) 1, 1 := 1 (2) 1, 1 := -1 (3) 1, 1 := 0 B (1) 1, 0 := 1 (2) 1, 0 := -1 (3) 1, 0 := 0 C (1) 0, 0 := 0 (2) 0, 0 := 1 (3) 0, 0 := -1 D (1) -1, -1 := -1 (2) -1, -1 := 1 (3) 1, 1 := 0 E (1) -1, 0 := -1 (2) -1, 0 := 1 (3) -1, 0 := 0 CS 583
Some Important Points • Every DS rule r is unexpected w.r.t. all r1 and rrestcombinations because r does not follow their directions. • After seeing the DS rules, the directions of non-DS rules are no longer surprising as they are just some combinations of DS rules and independence rules. • DS rules can guide the user to see the related non-DS rules, if he/she is interested. The non-DS rules can provide further details with regard to the DS rules. CS 583
Interactive Exploration • User interface to interactively • focus on the essential aspects of the domain. • selectively view the relevant details. • exploration of DS and Non-DS rules. • Features • DS can be viewed according to their levels. Ex: Level 1 rules R2:(1) (DS) Age=young -> Loan=not-approved Level 2 rule: R5: (1) (DS) Age=young,OH=yes-> Loan=approved CS 583
Features (cont…) • View reason for classifying a rule as DS. Ex: Level 1 rules R1:(-1) Age=young -> Loan=approved R2:(1) (DS) Age=young -> Loan=not-approved R3:(1) (DS) OH=yes -> Loan=approved R4:(-1) OH=yes -> Loan=not-approved Level 2 rule: R5: (1) (DS) Age=young,OH=yes -> Loan=approved -1, 1 := 1 {unexpected} System displays R1 & R3 together with directions CS 583
View relevant non-DS rules. Ex: Level 1 rules R1:(1) (DS) Job=yes -> Loan=approved R2:(1) (DS) OH=yes -> Loan=approved Level 2 rule: R3: (1) (non) Job=yes, OH=yes -> Loan=approved 1,1 := 1 {expected} System displays R3 when the user clicks to view the relevant non-DS rules that follow the direction of R1 and R2. CS 583
P-DS Algorithm P-DS prunes the discovered large rules and finds DS rules Input parameters: • F: the set of discovered large rules. • T: the 2 value at a particular significance level. • Process the discovered rules level-by-level from 1 to n • For each rule at a particular level • Call compDir tocompute the type of correlation/direction of r w.r.t. “-> y” • If r is a level-1 rule and its direction is 1 then r is a DS rule • else r’s direction is not 1 so record r as pruned by “-> y” CS 583
P-DS Algorithm (cont..) • For all other levels, process r using all pairs of riand rrest • If r is pruned and cannot be a DS rule, Exit • Evaluate r by calling evalPrune() • If r is still a potential DS rule Analyze r by considering the four direction setting cases to determine if it can be DS rule • If from the previous analyses, r can be a DS rule, check if r is pruned. If pruned, set direction of r as undefined else r adds to the DS rules set. CS 583
P-DS Algorithm(cont…) Procedure compDir(r,R,T) – computes direction of rule r If X2 value of rule r exceeds the threshold value If observed frequency > expected frequency then Direction = 1 else Direction = -1 Else Direction = 0 Procedure evalPrune(r,rrest) – tries to prune r using rrest • If rrest is pruned, get the rule that prunes rrest • If direction of r is not 1, then prune r and set its prune value as rrest • If direction of r is 1, then compare it with rrestusing 2. If result <> 1 then prune r. CS 583
Empirical Evaluation • Data with huge number of associations • 25 data sets from UCI Machine Learning Repository • Remaining 5 are real-life application data sets • Large rules processed in memory (combinatorial explosion) – limit 80,000 • Attributes are discretized into intervals using the target attribute • Experiments were repeated with • Significance levels for 2 test – 95% and 90% • Minimum support (minsup) values – 2% and 1% CS 583
Empirical Evaluation (cont…) CS 583
Empirical Evaluation (cont…) • Minimum support increases, fewer DS rules are produced • Significance level is lowered to 90%, more DS rules are produced • Pruning is less stringent CS 583
Conclusion • P-DS Algorithm • Prunes rules that contain less information as compared to their ancestors • Identifies the direction setting rules giving a global picture of underlying relationships in the domain • The number of DS rules generated can be manually inspected by human user • However, • This technique does not use minimum confidence • Does not report rules with negative correlation CS 583