390 likes | 394 Views
Pruning and Summarizing the Discovered Associations. Bing Liu Wynne Hsu Yiming Ma National University of Singapore Tushita Agrawal, Manju Navani April 14 th , 2005. Outline. Association Rule Mining Related Work Chi-Square Test Direction Setting Rules Pruning & Finding DS Rules
E N D
Pruning and Summarizing the Discovered Associations Bing Liu Wynne Hsu Yiming Ma National University of Singapore Tushita Agrawal, Manju Navani April 14th, 2005
Outline • Association Rule Mining • Related Work • Chi-Square Test • Direction Setting Rules • Pruning & Finding DS Rules • The Algorithm • Empirical Evaluation CS 583
Association Rules • A class of important regularities in data. • I = {i1, …, in} : a set of items. D : a set of data cases. • An association rule is an implication of the form X -> Y, where X C I, Y C I, and X n Y = Ø. • Rule X-> Y holds in D with confidence c if c% of data cases in D that support X also support Y. • The rule has support s in D if s% of the data case in D contains X U Y. CS 583
Association Rule Mining • Generating all association rules that have support and confidence greater than the user-specified min support and min confidence. • Algorithm: A 2-step process: • Finds all large itemsets that meet the min support constraint. • Generates rules form all large itemsets that satisfy the min confidence constraint. CS 583
Strengths & Drawbacks • Efficiently discovers the complete set of associations that exist in data. • These associations provide a complete picture of the underlying regularities in the domain. • The number of discovered associations can be huge. • Very difficult to be analyzed by a human user. • Even worse with those data sets whose items are highly correlated. CS 583
Solution??? • Give an arbitrary small subset of the rules to the user. • NOT SATISFACTORY!!! • Small subset can only give a partial picture of the domain. • “Can the completeness of association rule mining be preserved without overwhelming the user?” CS 583
Solution – DS Rules • Prune insignificant rules and find a special subset of association rules that represent the underlying relationships in the data. Direction setting (DS) rules. • Give a summary of the behavior of the discovered associations. • Represent the essential relationships or structure (or skeleton) of the domain. CS 583
In this paper... • Association rule mining from a relational table • Item: attribute = value • Numerical values discretized • Targeted at a specific attribute • Rule expressed as: X->y where y: a value of the target attribute, and X: a set of items from the rest of attributes • Large rule: A rule that meets the min supp. • Min conf is not used since does not reflect relationship of the domain. • Statistical correlation is the basis for rule finding. CS 583
Proposed Technique • Two steps: • Pruning the association rules • Summarizing the unpruned rules Direction setting rules Discovered large rules Pruning Significant rules Summarization Non Direction setting rules CS 583
Related Work • Template-Based Approach: • Most straightforward method for selecting interesting association rules • A template describes a set of rules in terms of the items occurring in the conditional and the consequent parts. e.g., Fruit+, Diary_product* Meat. • User specifies what he/she wants to see using templates • The system then finds only those matching rules. CS 583
Related Work (cont...) • Subjective Interestingness: • Interactive and iterative approach for finding unexpected rules. • User specifies his/her existing knowledge of the domain. • The system then finds unexpected rules by comparing user’s knowledge with the discovered rules. • If the unexpected rules identified by the system are not truly unexpected, they serve to remind the user what he/she might have forgotten. CS 583
Related Work (cont...) • Association Rule Cover Based: • A cover is a subset of the discovered associations that can cover the database. • The number of rules in a cover can be quite small. • A greedy algorithm is proposed to find a good cover and the remaining rules are pruned. • The problem with this method is that the advantage of association rules, its completeness, is lost. CS 583
Related Work (cont...) • Constraint Based Rule Mining: • Technique using Minimum Improvement. • Difference between the conf of a rule and the conf of any proper sub-rule with the same consequent. • Prune rules which do not meet the minimum improvement in conf. • Similar pruning method but chi-square test as basis of pruning. CS 583
Related Work (cont...) • Online Generation of ARs: • Technique to remove two types of redundant rules – simple and strict redundancy. • A rule R1is redundant w.r.t another rule R2 if the sup and conf of R1 are always at least as large as the sup and conf of R2. • Simple: Remove rules that are derived from same itemset. Ex: AB => C is redundant w.r.t. A=>BC • Strict: Applies to two itemsets and one is a subset of the other. Ex: X=> Y is redundant w.r.t. X=>YZ CS 583
Limitations • Do not prune insignificant rules. • Do not provide a summary the discovered rules. CS 583
Chi-Square Test Statistics (2) • To test Independence and Correlation • The closer the observed frequency is to expected frequency => greater is the weight of evidence in favor of independence • E.g. Association rule for a bank Job = yes -> Loan = approved [sup = 200/500, conf = 200/300] CS 583
Chi-Square Test Statistics(2)(cont..) • “Is loan approval dependent on whether one has a job or not?” • Hypothesis – two attributes are independent • Of 500 people • 300(60%) had a job,200(40%) had no job • 280 approved and 220 not-approved cases to be divided in same ratio. • 2 tests the significance of the deviation from the expected values CS 583
Chi-Square Test Statistics(2)(cont...) • A 2 value of 0 implies attributes are statistically independent • If higher than certain threshold then reject the independence assumption • In our case 2 = 34.63 which is much larger than the threshold(3.84 at significance 95%) • Conclusion - loan approval is correlated to whether one has a job. • How are they correlated? CS 583
Chi-Square Test Statistics(2)(cont...) s : minimum support c : significance level X -> y are said to be (s,c) • Correlated if • The rule’s support exceeds s. • The 2 value for the rule with respect to the whole data exceeds 2 at c • Uncorrelated (independent) if • The rule’s support exceeds s. • The 2 value for the rule with respect to the whole data does not exceed 2 at c CS 583
Chi-Square Test Statistics(2)(cont...) Types of correlation/direction • Positive correlation • X and y of rule r, X -> y are correlated and f0 / f > 1 then Direction of r is 1 • Negative correlation • X and y of rule r, X -> y are correlated and f0 / f < 1 then Direction of r is -1 • Independence • X and y of rule r, X -> y are independent then Direction of r is 0 The rule in previous example represents positive correlation f0 = 200 ; f = 300 * 280/500 = 168 CS 583
Chi-Square Test Statistics(2)(cont...) A generic 2 X 2 contingency table To compute the correlation type for rule r: X -> y Compare the rule with the whole population or the whole data set CS 583
Direction Setting (DS) Rules Set of expected directions of a rule r is • If r is a 1-condition rule, the set of expected direction is {0} (condition and consequent are expected to be independent) • If r is a k-condition rule r (k>1) of the form r: a1,a2,.., ak -> y • View r as a combination of 2 rules, a 1-condition rule r1: ai -> y and (k-1) condition rule rrest : a1,a2,..,aj -> y • The expected direction for this combination is CS 583
Direction Setting (DS) Rules(cont..) A rule r is a DS rule, if it satisfies • It has a direction of 1 (positive correlation) • Its direction is not an element of the set of expected directions A non-direction setting (non-DS) rule is a positively correlated rule that is not a DS rule CS 583
Direction Setting (DS) Rules(cont..) All possible combinations of r1, rrest and r r1.dir, rrest.dir := r.dir or rrest.dir,r1.dir := r.dir • C(2), D(2), E(2) and F(2) have direction of r as 1, but expected direction is 0 or Unknown. • Thus r sets a new direction Potential DS rule. • r is a DS rule if for all combinations of r1 and rrest, direction of r is different from expected directions. CS 583
Pruning Association Rules • Why? • The number of discovered associations can be huge. • Many mined rules are spurious and insignificant. • Their existence may simply be due to chance. • Ex: P: Job=yes ->Loan=approved [sup=60%,conf=90%] r: Job=yes, Credit_history=good -> Loan= approved [sup = 40%, conf = 91%] • r is insignificant with respect to P (P is more general). • r's slightly higher confidence is more likely due to chance. CS 583
Pruning Association Rules (Cont...) • So r can be pruned w.r.t. P • Instead of using the whole dataset, correlation of r is tested w.r.t. P as r only covers a subset of the data cases that are covered by P • Rule: Given a rule r, X->y, prune r using each ancestor rule P, ->y, which has the same consequent as r but fewer or 0 conditions. • How done? Perform a 2 test on r with respect to P. Positive correlation => keep r Otherwise prune r. CS 583
Finding DS Rules • Evaluate each 1-condition rule to determine its direction status (i.e. -1/0/1) • Proceed level-by-level to analyze each rule and check if it follows the direction set by previous level rules or it sets a new direction : • At level k (k >1), for each k-condition rule r, determine its direction. • Examine each combination of 1-condition rule r1 and (k-1)-condition rule rrestof r to determine whether r follows the expected direction set by r1 and rrest. • If r follows the direction set by at least one such combination, r is NOT a DS rule. If not, then r sets a new direction, and it is a DS rule. CS 583
Finding DS Rules (cont...) A (1) 1, 1 := 1 (2) 1, 1 := -1 (3) 1, 1 := 0 B (1) 1, 0 := 1 (2) 1, 0 := -1 (3) 1, 0 := 0 C (1) 0, 0 := 0 (2) 0, 0 := 1 (3) 0, 0 := -1 D (1) -1, -1 := -1 (2) -1, -1 := 1 (3) 1, 1 := 0 E (1) -1, 0 := -1 (2) -1, 0 := 1 (3) -1, 0 := 0 CS 583
Some Important Points • Every DS rule r is unexpected w.r.t. all r1 and rrestcombinations because r does not follow their directions. • After seeing the DS rules, the directions of non-DS rules are no longer surprising as they are just some combinations of DS rules and independence rules. • DS rules can guide the user to see the related non-DS rules, if he/she is interested. The non-DS rules can provide further details with regard to the DS rules. CS 583
Interactive Exploration • User interface to interactively • focus on the essential aspects of the domain. • selectively view the relevant details. • exploration of DS and Non-DS rules. • Features • DS can be viewed according to their levels. Ex: Level 1 rules R2:(1) (DS) Age=young -> Loan=not-approved Level 2 rule: R5: (1) (DS) Age=young,OH=yes-> Loan=approved CS 583
Features (cont…) • View reason for classifying a rule as DS. Ex: Level 1 rules R1:(-1) Age=young -> Loan=approved R2:(1) (DS) Age=young -> Loan=not-approved R3:(1) (DS) OH=yes -> Loan=approved R4:(-1) OH=yes -> Loan=not-approved Level 2 rule: R5: (1) (DS) Age=young,OH=yes -> Loan=approved -1, 1 := 1 {unexpected} System displays R1 & R3 together with directions CS 583
View relevant non-DS rules. Ex: Level 1 rules R1:(1) (DS) Job=yes -> Loan=approved R2:(1) (DS) OH=yes -> Loan=approved Level 2 rule: R3: (1) (non) Job=yes, OH=yes -> Loan=approved 1,1 := 1 {expected} System displays R3 when the user clicks to view the relevant non-DS rules that follow the direction of R1 and R2. CS 583
P-DS Algorithm P-DS prunes the discovered large rules and finds DS rules Input parameters: • F: the set of discovered large rules. • T: the 2 value at a particular significance level. • Process the discovered rules level-by-level from 1 to n • For each rule at a particular level • Call compDir tocompute the type of correlation/direction of r w.r.t. “-> y” • If r is a level-1 rule and its direction is 1 then r is a DS rule • else r’s direction is not 1 so record r as pruned by “-> y” CS 583
P-DS Algorithm (cont..) • For all other levels, process r using all pairs of riand rrest • If r is pruned and cannot be a DS rule, Exit • Evaluate r by calling evalPrune() • If r is still a potential DS rule Analyze r by considering the four direction setting cases to determine if it can be DS rule • If from the previous analyses, r can be a DS rule, check if r is pruned. If pruned, set direction of r as undefined else r adds to the DS rules set. CS 583
P-DS Algorithm(cont…) Procedure compDir(r,R,T) – computes direction of rule r If X2 value of rule r exceeds the threshold value If observed frequency > expected frequency then Direction = 1 else Direction = -1 Else Direction = 0 Procedure evalPrune(r,rrest) – tries to prune r using rrest • If rrest is pruned, get the rule that prunes rrest • If direction of r is not 1, then prune r and set its prune value as rrest • If direction of r is 1, then compare it with rrestusing 2. If result <> 1 then prune r. CS 583
Empirical Evaluation • Data with huge number of associations • 25 data sets from UCI Machine Learning Repository • Remaining 5 are real-life application data sets • Large rules processed in memory (combinatorial explosion) – limit 80,000 • Attributes are discretized into intervals using the target attribute • Experiments were repeated with • Significance levels for 2 test – 95% and 90% • Minimum support (minsup) values – 2% and 1% CS 583
Empirical Evaluation (cont…) CS 583
Empirical Evaluation (cont…) • Minimum support increases, fewer DS rules are produced • Significance level is lowered to 90%, more DS rules are produced • Pruning is less stringent CS 583
Conclusion • P-DS Algorithm • Prunes rules that contain less information as compared to their ancestors • Identifies the direction setting rules giving a global picture of underlying relationships in the domain • The number of DS rules generated can be manually inspected by human user • However, • This technique does not use minimum confidence • Does not report rules with negative correlation CS 583