1 / 30

Daniel Delic, Hans-J. Lenz, Mattis Neiling Free University of Berlin Institute of Applied Computer Science Garystr. 21,

Mining Association Rules with Rough Sets and Large Itemsets - A Comparative Study Daniel Delic, Hans-J. Lenz, Mattis Neiling Free University of Berlin Institute of Applied Computer Science Garystr. 21, D-14195 Berlin, Germany Task Comparison of both methods Interesting Questions

Gabriel
Download Presentation

Daniel Delic, Hans-J. Lenz, Mattis Neiling Free University of Berlin Institute of Applied Computer Science Garystr. 21,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Association Rules with Rough Sets and Large Itemsets - A Comparative Study Daniel Delic, Hans-J. Lenz, Mattis Neiling Free University of Berlin Institute of Applied Computer Science Garystr. 21, D-14195 Berlin, Germany

  2. Task • Comparison of both methods Interesting Questions • Are there any differences/ similarities between the extracted rules? • If so: • Which method leads to better rules? • Could a combination of both procedures improve the quality of the derived rules? Two different methods for the extraction of association rules • Large itemset method (e.g. Apriori) • Rough set method 1 INTRODUCTION

  3. Introduction • Large Itemset Method • Rough Set Method • Comparison of the Procedures • Hybrid Procedure Apriori+ • Summary • Outlook • References

  4. LARGE ITEMSET METHOD 2 LARGE ITEMSET METHOD

  5. Type of analyzable data • "Market basket data" Attributes with boolean domains • Stored in table  Each row representing a market basket 2 LARGE ITEMSET METHOD

  6. Step 1 • Candidate 1-Itemsets • Spaghetti  support = 3 = 60% • Tomato Sauce  support = 3 = 60% • Bread  support = 5 = 60% • Butter  support = 1 = 20% ^ ^ ^ ^ • Large k-Itemset generation with Apriori • Minimum support 40% 2 LARGE ITEMSET METHOD

  7. Step 2 • Large 1-Itemsets • Spaghetti • Tomato Sauce • Bread • Candidate 2-Itemsets • {Spaghetti, Tomato Sauce}  support = 2 = 40% • {Spaghetti, Bread}  support = 2 = 40% • {Tomato Sauce, Bread}  support = 2 = 40% ^ ^ ^ 2 LARGE ITEMSET METHOD

  8. Step 3 • Large 2-Itemsets • {Spaghetti, Tomato Sauce} • {Spaghetti, Bread} • {Tomato Sauce, Bread} • Candidate 3-Itemsets • {Spaghetti, Tomato Sauce,Bread}  support = 1 = 20% ^ • Large 3-Itemsets • { } 2 LARGE ITEMSET METHOD

  9. Step 4 • Association Rules Scheme: If subset then large k-itemset with support s and confidence c • s = (support of large k-itemset) / (total count of tupels) • c = (support of large k-itemset) / (support of subset) • Example • Total count of tupels = 5 Large 2-itemset = {Spaghetti, Tomato Sauce} • Support (Spaghetti, Tomato Sauce) = 2 Subsets = { {Spaghetti}, {Tomato Sauce} } • Support (Spaghetti) = 3 • Support (Tomato Sauce) = 3 Scheme: If {Spaghetti} then {Spaghetti, Tomato Sauce} Rule: If Spaghetti then Tomato Sauce Support: s = 2 / 5 = 0,4 = 40% Confidence: c = 2 / 3  0,66 = 66% ^ ^ 2 LARGE ITEMSET METHOD

  10. ROUGH SET METHOD

  11. Type of analyseable data • Attributes which can have more than two values • Predefined set of condition attributes and decision attribute(s) • Stored in table  each row containing values of the predefined attributes 3 ROUGH SET METHOD

  12. Deriving association rules with rough sets Step 1 Creating partitions over U Partition: U divided into subsets (equivalence classes) induced by equivalence relations 3 ROUGH SET METHOD

  13. Examples of Equivalence relations: R1 = {(u, v)|u and v have the same temperature} R2 = {(u, v)|u and v have the same blood pressure} R3 = {(u, v)|u and v have the same temperature and blood pressure} R4 = {(u, v)|u and v have the same heart problem} 3 ROUGH SET METHOD

  14. X1 X2 X3 Partition R3* Induced by equivalence relation R3 (based on condition attributes) R3 = {(u, v)|u and v have the same temperature and blood pressure} R3  R3 * = {X1, X2, X3} with X1 = {Adams, Brown}, X2 = {Ford}, X3 = {Gill, Bellows} 3 ROUGH SET METHOD

  15. Y1 Y2 Partition R4* Induced by equivalence relation R4 (based on decision attribute(s)) R4 = {(u, v)|u and v have the same heart problem} R4  R4 * = {Y1, Y2} with Y1 = {Adams, Brown, Gill}, Y2 = {Ford, Bellows} 3 ROUGH SET METHOD

  16. X1 Y1 X2 Y2 X3 Step 2 • Defining the approximation space • overlapping the partitions created by the equivalence relations • Result: 3 distinct regions in the approximation space • Positive region: POSS(Yj) = UxiYjXi =X1 • Boundary region: BNDS(Yj) = UxiYjXi =X3 • Negative region: NEGS(Yj) = UxiYj=Xi =X2 3 ROUGH SET METHOD

  17. X1 Y1 • Rules from positive region (POSS(Yj) = UxiYjXi ) • Example for POSS(Y1) • X1 = {Adams, Brown}  Y1 = {Adams, Brown, Gill} •  Clear rule (confidence 100%, support 40%): • If temperature normal and blood pressure low then heart problem no 3 ROUGH SET METHOD

  18. Y1 X3 • Rules from boundary region (BNDS(Yj) = UxiYjXi) • Example for BNDS(Y1) • X3 = {Gill, Bellows}  Y1 = {Adams, Brown, Gill} •  possible rule (confidence ?, support 20%): • If temperature high and blood pressure high then heart problem no •  confidence:c = |Xi Yj| / |Xj| = |X3 Y1| / |X3| = 1 / 2 = 0,5 = 50% 3 ROUGH SET METHOD

  19. Y1 X2 • Negative region (NEGS(Yj) = UxiYj=Xi) • Example for NEGS(Y1) • X2 = {Ford}  Y1 = {Adams, Brown, Gill} •  since X2 Y1 = , no rule derivable from the negative region 3 ROUGH SET METHOD

  20. Reducts  Simplification of rules by removal of unecessary attributes  Original rule: If temperature normal and blood pressure low then heart problem no Simplified (more precise) rule: If blood pressure low then heart problem no 3 ROUGH SET METHOD

  21. COMPARISON OF THE PROCEDURES

  22. Large Itemsets Rough Sets Universe persons Cond. attributes blood pressure Dec.attribute(s) heart problem TID Attributes 1 spaghetti, tomato sauce Adams low ... 2 spaghetti, bread Brown medium ... Ford high ...   TID spaghetti tomato sauce bread TID bp_low bp_med bp_high ... 1 1 1 0 1 1 0 0 ... 2 1 0 1 2 0 1 0 ... 3 0 0 1 ... • Prerequisites for comparison of both methods • modification of rough set method (RS-Rules)  no fixed decision attribute required (RS-Rules+) • Compatible data structure  Bitmaps 4 DATA TRANSFORMATION

  23. Computing times2 Database Car Evaluation Mushroom Adult Minconfidence 10% 35% 17% Minsupport 75% 90% 94% Method RS+ Apr RS+ Apr RS+ Apr CPU Time [min] 3.15 1.10 15 2 233 44 • Benchmark data sets1 • Car Evaluation Database: 1728 tuples, 25 bitmap attributes • Mushroom Database: 8416 tuples, 12 original attributes selected, 68 bitmap attributes • Adult: 32561 tuples, 12 original attributes selected, 61 bitmap attributes • Results • almost similar results for all examined tables • exceptions: reducts  Quality of rough set rules better (more precise rules) 1 UCI Repository of Machine Learning Database and Domain Theories (URL: ftp.ics.uci.edu/pub/machine-learning-databases 2 Algorithms written in Visual Basic 6.0, executed on Win98 PC with AMD K6-2/400 processor 5 COMPARISON OF THE PROCEDURES

  24. HYBRID PROCEDURE Apriori+ 6 HYBRID PROCEDURE Apriori+

  25. Computing Times minutes • Hybrid Method Apriori+ • based on Apriori • capable of extracting reducts • capable of deriving rules based on predefined decision attribute • Comparison Results (Apriori+ compared to RS-Rules+) • identical rules 6 HYBRID PROCEDURE Apriori+

  26. SUMMARY

  27. creation of a compatible datatype for both methods • comparison of both methods • RS-Rules+ derived rules that were more precise (due to reducts) than those derived by Apriori • Apriori+ derived same rules as RS-Rules+ • Computing times in favor of the large itemset methods Conclusion: Combination of both original methods best solution 7 CONCLUSION

  28. OUTLOOK

  29. More Interesting Capabilities of Rough Sets • Analysing dependencies between rules • Analysing the impact of one special condition attribute on the decision attribute(s) Idea Enhancing the data mining capabilities of Apriori+ by those further rough set features  Result: A powerful and efficient data mining application (?) 8 OUTLOOK

  30. References Agrawal, R. and Srikant, S. (1994). Fast Algorithms for Mining Association Rules in Large Databases. In: VLDB’94, 487–499. Morgan Kaufmann. Düntsch, I. and Gediga G. (1999). Rough set data analysis. Munakata, T. (1998). Rough Sets. In: Fundamentals of the New Artificial Intelligence, 140–182. New York: Springer-Verlag.

More Related