Daniel Delic, Hans-J. Lenz, Mattis Neiling Free University of Berlin Institute of Applied Computer Science Garystr. 21,

Mining Association Rules with Rough Sets and Large Itemsets - A Comparative Study Daniel Delic, Hans-J. Lenz, Mattis Neiling Free University of Berlin Institute of Applied Computer Science Garystr. 21, D-14195 Berlin, Germany

Task • Comparison of both methods Interesting Questions • Are there any differences/ similarities between the extracted rules? • If so: • Which method leads to better rules? • Could a combination of both procedures improve the quality of the derived rules? Two different methods for the extraction of association rules • Large itemset method (e.g. Apriori) • Rough set method 1 INTRODUCTION

Introduction • Large Itemset Method • Rough Set Method • Comparison of the Procedures • Hybrid Procedure Apriori+ • Summary • Outlook • References

LARGE ITEMSET METHOD 2 LARGE ITEMSET METHOD

Type of analyzable data • "Market basket data" Attributes with boolean domains • Stored in table  Each row representing a market basket 2 LARGE ITEMSET METHOD

Step 1 • Candidate 1-Itemsets • Spaghetti  support = 3 = 60% • Tomato Sauce  support = 3 = 60% • Bread  support = 5 = 60% • Butter  support = 1 = 20% ^ ^ ^ ^ • Large k-Itemset generation with Apriori • Minimum support 40% 2 LARGE ITEMSET METHOD

Step 2 • Large 1-Itemsets • Spaghetti • Tomato Sauce • Bread • Candidate 2-Itemsets • {Spaghetti, Tomato Sauce}  support = 2 = 40% • {Spaghetti, Bread}  support = 2 = 40% • {Tomato Sauce, Bread}  support = 2 = 40% ^ ^ ^ 2 LARGE ITEMSET METHOD

Step 3 • Large 2-Itemsets • {Spaghetti, Tomato Sauce} • {Spaghetti, Bread} • {Tomato Sauce, Bread} • Candidate 3-Itemsets • {Spaghetti, Tomato Sauce,Bread}  support = 1 = 20% ^ • Large 3-Itemsets • { } 2 LARGE ITEMSET METHOD

Step 4 • Association Rules Scheme: If subset then large k-itemset with support s and confidence c • s = (support of large k-itemset) / (total count of tupels) • c = (support of large k-itemset) / (support of subset) • Example • Total count of tupels = 5 Large 2-itemset = {Spaghetti, Tomato Sauce} • Support (Spaghetti, Tomato Sauce) = 2 Subsets = { {Spaghetti}, {Tomato Sauce} } • Support (Spaghetti) = 3 • Support (Tomato Sauce) = 3 Scheme: If {Spaghetti} then {Spaghetti, Tomato Sauce} Rule: If Spaghetti then Tomato Sauce Support: s = 2 / 5 = 0,4 = 40% Confidence: c = 2 / 3  0,66 = 66% ^ ^ 2 LARGE ITEMSET METHOD

ROUGH SET METHOD

Type of analyseable data • Attributes which can have more than two values • Predefined set of condition attributes and decision attribute(s) • Stored in table  each row containing values of the predefined attributes 3 ROUGH SET METHOD

Deriving association rules with rough sets Step 1 Creating partitions over U Partition: U divided into subsets (equivalence classes) induced by equivalence relations 3 ROUGH SET METHOD

Examples of Equivalence relations: R1 = {(u, v)|u and v have the same temperature} R2 = {(u, v)|u and v have the same blood pressure} R3 = {(u, v)|u and v have the same temperature and blood pressure} R4 = {(u, v)|u and v have the same heart problem} 3 ROUGH SET METHOD

X1 X2 X3 Partition R3* Induced by equivalence relation R3 (based on condition attributes) R3 = {(u, v)|u and v have the same temperature and blood pressure} R3  R3 * = {X1, X2, X3} with X1 = {Adams, Brown}, X2 = {Ford}, X3 = {Gill, Bellows} 3 ROUGH SET METHOD

Y1 Y2 Partition R4* Induced by equivalence relation R4 (based on decision attribute(s)) R4 = {(u, v)|u and v have the same heart problem} R4  R4 * = {Y1, Y2} with Y1 = {Adams, Brown, Gill}, Y2 = {Ford, Bellows} 3 ROUGH SET METHOD

X1 Y1 X2 Y2 X3 Step 2 • Defining the approximation space • overlapping the partitions created by the equivalence relations • Result: 3 distinct regions in the approximation space • Positive region: POSS(Yj) = UxiYjXi =X1 • Boundary region: BNDS(Yj) = UxiYjXi =X3 • Negative region: NEGS(Yj) = UxiYj=Xi =X2 3 ROUGH SET METHOD

X1 Y1 • Rules from positive region (POSS(Yj) = UxiYjXi ) • Example for POSS(Y1) • X1 = {Adams, Brown}  Y1 = {Adams, Brown, Gill} •  Clear rule (confidence 100%, support 40%): • If temperature normal and blood pressure low then heart problem no 3 ROUGH SET METHOD

Y1 X3 • Rules from boundary region (BNDS(Yj) = UxiYjXi) • Example for BNDS(Y1) • X3 = {Gill, Bellows}  Y1 = {Adams, Brown, Gill} •  possible rule (confidence ?, support 20%): • If temperature high and blood pressure high then heart problem no •  confidence:c = |Xi Yj| / |Xj| = |X3 Y1| / |X3| = 1 / 2 = 0,5 = 50% 3 ROUGH SET METHOD

Y1 X2 • Negative region (NEGS(Yj) = UxiYj=Xi) • Example for NEGS(Y1) • X2 = {Ford}  Y1 = {Adams, Brown, Gill} •  since X2 Y1 = , no rule derivable from the negative region 3 ROUGH SET METHOD

Reducts  Simplification of rules by removal of unecessary attributes  Original rule: If temperature normal and blood pressure low then heart problem no Simplified (more precise) rule: If blood pressure low then heart problem no 3 ROUGH SET METHOD

COMPARISON OF THE PROCEDURES

Large Itemsets Rough Sets Universe persons Cond. attributes blood pressure Dec.attribute(s) heart problem TID Attributes 1 spaghetti, tomato sauce Adams low ... 2 spaghetti, bread Brown medium ... Ford high ...   TID spaghetti tomato sauce bread TID bp_low bp_med bp_high ... 1 1 1 0 1 1 0 0 ... 2 1 0 1 2 0 1 0 ... 3 0 0 1 ... • Prerequisites for comparison of both methods • modification of rough set method (RS-Rules)  no fixed decision attribute required (RS-Rules+) • Compatible data structure  Bitmaps 4 DATA TRANSFORMATION

Computing times2 Database Car Evaluation Mushroom Adult Minconfidence 10% 35% 17% Minsupport 75% 90% 94% Method RS+ Apr RS+ Apr RS+ Apr CPU Time [min] 3.15 1.10 15 2 233 44 • Benchmark data sets1 • Car Evaluation Database: 1728 tuples, 25 bitmap attributes • Mushroom Database: 8416 tuples, 12 original attributes selected, 68 bitmap attributes • Adult: 32561 tuples, 12 original attributes selected, 61 bitmap attributes • Results • almost similar results for all examined tables • exceptions: reducts  Quality of rough set rules better (more precise rules) 1 UCI Repository of Machine Learning Database and Domain Theories (URL: ftp.ics.uci.edu/pub/machine-learning-databases 2 Algorithms written in Visual Basic 6.0, executed on Win98 PC with AMD K6-2/400 processor 5 COMPARISON OF THE PROCEDURES

HYBRID PROCEDURE Apriori+ 6 HYBRID PROCEDURE Apriori+

Computing Times minutes • Hybrid Method Apriori+ • based on Apriori • capable of extracting reducts • capable of deriving rules based on predefined decision attribute • Comparison Results (Apriori+ compared to RS-Rules+) • identical rules 6 HYBRID PROCEDURE Apriori+

SUMMARY

creation of a compatible datatype for both methods • comparison of both methods • RS-Rules+ derived rules that were more precise (due to reducts) than those derived by Apriori • Apriori+ derived same rules as RS-Rules+ • Computing times in favor of the large itemset methods Conclusion: Combination of both original methods best solution 7 CONCLUSION

OUTLOOK

More Interesting Capabilities of Rough Sets • Analysing dependencies between rules • Analysing the impact of one special condition attribute on the decision attribute(s) Idea Enhancing the data mining capabilities of Apriori+ by those further rough set features  Result: A powerful and efficient data mining application (?) 8 OUTLOOK

References Agrawal, R. and Srikant, S. (1994). Fast Algorithms for Mining Association Rules in Large Databases. In: VLDB’94, 487–499. Morgan Kaufmann. Düntsch, I. and Gediga G. (1999). Rough set data analysis. Munakata, T. (1998). Rough Sets. In: Fundamentals of the New Artificial Intelligence, 140–182. New York: Springer-Verlag.

Daniel Delic, Hans-J. Lenz, Mattis Neiling Free University of Berlin Institute of Applied Computer Science Garystr. 21,

Daniel Delic, Hans-J. Lenz, Mattis Neiling Free University of Berlin Institute of Applied Computer Science Garystr. 21,

Presentation Transcript

THE GEORGE WASHINGTON UNIVERSITY SCHOOL OF ENGINEERING AND APPLIED SCIENCE DEPARTMENT OF COMPUTER SCIENCE

University of Bridgeport Computer Science

Kiel University of Applied Science

University of Virginia Computer Science

Northern Institute of Applied Climate Science

Daniel J. Rader, MD University of Pennsylvania

Anna Rybak Institute of Computer Science, University of Białystok, Poland

University of Berne Institute of Computer Science and Applied Mathematics – IAM/RVS

Institute for Applied Computer Science in Mechanical

Faculty of Engineering and Applied Science University of Ontario Institute of Technology Canada

Humboldt University of Berlin, Institute of Physics, Chair of Crystallography

Author: Werner Kie ling Institute of Computer Science University of Augsburg

Institute of Applied Microelectronics and Computer Engineering

Hans-Jürgen Hoffmann Darmstadt University of Technology Dept. of Computer Science June 2003

Free University of Berlin Institute of Computer Science AI Group

University of Applied Sciences and Arts Department of Computer Science

Institute of Applied Microelectronics and Computer Engineering

Ms Ramaiah university of applied science

University of Virginia Computer Science

Institute of Applied Microelectronics and Computer Engineering