300 likes | 915 Views
Mining Association Rules with Rough Sets and Large Itemsets - A Comparative Study Daniel Delic, Hans-J. Lenz, Mattis Neiling Free University of Berlin Institute of Applied Computer Science Garystr. 21, D-14195 Berlin, Germany Task Comparison of both methods Interesting Questions
E N D
Mining Association Rules with Rough Sets and Large Itemsets - A Comparative Study Daniel Delic, Hans-J. Lenz, Mattis Neiling Free University of Berlin Institute of Applied Computer Science Garystr. 21, D-14195 Berlin, Germany
Task • Comparison of both methods Interesting Questions • Are there any differences/ similarities between the extracted rules? • If so: • Which method leads to better rules? • Could a combination of both procedures improve the quality of the derived rules? Two different methods for the extraction of association rules • Large itemset method (e.g. Apriori) • Rough set method 1 INTRODUCTION
Introduction • Large Itemset Method • Rough Set Method • Comparison of the Procedures • Hybrid Procedure Apriori+ • Summary • Outlook • References
LARGE ITEMSET METHOD 2 LARGE ITEMSET METHOD
Type of analyzable data • "Market basket data" Attributes with boolean domains • Stored in table Each row representing a market basket 2 LARGE ITEMSET METHOD
Step 1 • Candidate 1-Itemsets • Spaghetti support = 3 = 60% • Tomato Sauce support = 3 = 60% • Bread support = 5 = 60% • Butter support = 1 = 20% ^ ^ ^ ^ • Large k-Itemset generation with Apriori • Minimum support 40% 2 LARGE ITEMSET METHOD
Step 2 • Large 1-Itemsets • Spaghetti • Tomato Sauce • Bread • Candidate 2-Itemsets • {Spaghetti, Tomato Sauce} support = 2 = 40% • {Spaghetti, Bread} support = 2 = 40% • {Tomato Sauce, Bread} support = 2 = 40% ^ ^ ^ 2 LARGE ITEMSET METHOD
Step 3 • Large 2-Itemsets • {Spaghetti, Tomato Sauce} • {Spaghetti, Bread} • {Tomato Sauce, Bread} • Candidate 3-Itemsets • {Spaghetti, Tomato Sauce,Bread} support = 1 = 20% ^ • Large 3-Itemsets • { } 2 LARGE ITEMSET METHOD
Step 4 • Association Rules Scheme: If subset then large k-itemset with support s and confidence c • s = (support of large k-itemset) / (total count of tupels) • c = (support of large k-itemset) / (support of subset) • Example • Total count of tupels = 5 Large 2-itemset = {Spaghetti, Tomato Sauce} • Support (Spaghetti, Tomato Sauce) = 2 Subsets = { {Spaghetti}, {Tomato Sauce} } • Support (Spaghetti) = 3 • Support (Tomato Sauce) = 3 Scheme: If {Spaghetti} then {Spaghetti, Tomato Sauce} Rule: If Spaghetti then Tomato Sauce Support: s = 2 / 5 = 0,4 = 40% Confidence: c = 2 / 3 0,66 = 66% ^ ^ 2 LARGE ITEMSET METHOD
Type of analyseable data • Attributes which can have more than two values • Predefined set of condition attributes and decision attribute(s) • Stored in table each row containing values of the predefined attributes 3 ROUGH SET METHOD
Deriving association rules with rough sets Step 1 Creating partitions over U Partition: U divided into subsets (equivalence classes) induced by equivalence relations 3 ROUGH SET METHOD
Examples of Equivalence relations: R1 = {(u, v)|u and v have the same temperature} R2 = {(u, v)|u and v have the same blood pressure} R3 = {(u, v)|u and v have the same temperature and blood pressure} R4 = {(u, v)|u and v have the same heart problem} 3 ROUGH SET METHOD
X1 X2 X3 Partition R3* Induced by equivalence relation R3 (based on condition attributes) R3 = {(u, v)|u and v have the same temperature and blood pressure} R3 R3 * = {X1, X2, X3} with X1 = {Adams, Brown}, X2 = {Ford}, X3 = {Gill, Bellows} 3 ROUGH SET METHOD
Y1 Y2 Partition R4* Induced by equivalence relation R4 (based on decision attribute(s)) R4 = {(u, v)|u and v have the same heart problem} R4 R4 * = {Y1, Y2} with Y1 = {Adams, Brown, Gill}, Y2 = {Ford, Bellows} 3 ROUGH SET METHOD
X1 Y1 X2 Y2 X3 Step 2 • Defining the approximation space • overlapping the partitions created by the equivalence relations • Result: 3 distinct regions in the approximation space • Positive region: POSS(Yj) = UxiYjXi =X1 • Boundary region: BNDS(Yj) = UxiYjXi =X3 • Negative region: NEGS(Yj) = UxiYj=Xi =X2 3 ROUGH SET METHOD
X1 Y1 • Rules from positive region (POSS(Yj) = UxiYjXi ) • Example for POSS(Y1) • X1 = {Adams, Brown} Y1 = {Adams, Brown, Gill} • Clear rule (confidence 100%, support 40%): • If temperature normal and blood pressure low then heart problem no 3 ROUGH SET METHOD
Y1 X3 • Rules from boundary region (BNDS(Yj) = UxiYjXi) • Example for BNDS(Y1) • X3 = {Gill, Bellows} Y1 = {Adams, Brown, Gill} • possible rule (confidence ?, support 20%): • If temperature high and blood pressure high then heart problem no • confidence:c = |Xi Yj| / |Xj| = |X3 Y1| / |X3| = 1 / 2 = 0,5 = 50% 3 ROUGH SET METHOD
Y1 X2 • Negative region (NEGS(Yj) = UxiYj=Xi) • Example for NEGS(Y1) • X2 = {Ford} Y1 = {Adams, Brown, Gill} • since X2 Y1 = , no rule derivable from the negative region 3 ROUGH SET METHOD
Reducts Simplification of rules by removal of unecessary attributes Original rule: If temperature normal and blood pressure low then heart problem no Simplified (more precise) rule: If blood pressure low then heart problem no 3 ROUGH SET METHOD
COMPARISON OF THE PROCEDURES
Large Itemsets Rough Sets Universe persons Cond. attributes blood pressure Dec.attribute(s) heart problem TID Attributes 1 spaghetti, tomato sauce Adams low ... 2 spaghetti, bread Brown medium ... Ford high ... TID spaghetti tomato sauce bread TID bp_low bp_med bp_high ... 1 1 1 0 1 1 0 0 ... 2 1 0 1 2 0 1 0 ... 3 0 0 1 ... • Prerequisites for comparison of both methods • modification of rough set method (RS-Rules) no fixed decision attribute required (RS-Rules+) • Compatible data structure Bitmaps 4 DATA TRANSFORMATION
Computing times2 Database Car Evaluation Mushroom Adult Minconfidence 10% 35% 17% Minsupport 75% 90% 94% Method RS+ Apr RS+ Apr RS+ Apr CPU Time [min] 3.15 1.10 15 2 233 44 • Benchmark data sets1 • Car Evaluation Database: 1728 tuples, 25 bitmap attributes • Mushroom Database: 8416 tuples, 12 original attributes selected, 68 bitmap attributes • Adult: 32561 tuples, 12 original attributes selected, 61 bitmap attributes • Results • almost similar results for all examined tables • exceptions: reducts Quality of rough set rules better (more precise rules) 1 UCI Repository of Machine Learning Database and Domain Theories (URL: ftp.ics.uci.edu/pub/machine-learning-databases 2 Algorithms written in Visual Basic 6.0, executed on Win98 PC with AMD K6-2/400 processor 5 COMPARISON OF THE PROCEDURES
HYBRID PROCEDURE Apriori+ 6 HYBRID PROCEDURE Apriori+
Computing Times minutes • Hybrid Method Apriori+ • based on Apriori • capable of extracting reducts • capable of deriving rules based on predefined decision attribute • Comparison Results (Apriori+ compared to RS-Rules+) • identical rules 6 HYBRID PROCEDURE Apriori+
creation of a compatible datatype for both methods • comparison of both methods • RS-Rules+ derived rules that were more precise (due to reducts) than those derived by Apriori • Apriori+ derived same rules as RS-Rules+ • Computing times in favor of the large itemset methods Conclusion: Combination of both original methods best solution 7 CONCLUSION
More Interesting Capabilities of Rough Sets • Analysing dependencies between rules • Analysing the impact of one special condition attribute on the decision attribute(s) Idea Enhancing the data mining capabilities of Apriori+ by those further rough set features Result: A powerful and efficient data mining application (?) 8 OUTLOOK
References Agrawal, R. and Srikant, S. (1994). Fast Algorithms for Mining Association Rules in Large Databases. In: VLDB’94, 487–499. Morgan Kaufmann. Düntsch, I. and Gediga G. (1999). Rough set data analysis. Munakata, T. (1998). Rough Sets. In: Fundamentals of the New Artificial Intelligence, 140–182. New York: Springer-Verlag.