1 / 44

Fundamentos de Minería de Datos

Fundamentos de Minería de Datos. Reglas de asociación. Fernando Berzal fberzal@decsai.ugr.es. Motivation. Association mining searches for interesting relationships among items in a given data set EXAMPLES Diapers and six-packs are bought together, specially on Thursday evening (a myth?)

sasha
Download Presentation

Fundamentos de Minería de Datos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fundamentos de Minería de Datos Reglas de asociación Fernando Berzalfberzal@decsai.ugr.es

  2. Motivation Association mining searches for interesting relationships among items in a given data set EXAMPLES • Diapers and six-packs are bought together, specially on Thursday evening (a myth?) • A sequence such as buying first a digital camera and then a memory card is a frequent (sequential) pattern • … • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  3. Motivation MARKET BASKET ANALYSIS The earliest form of association rule mining Applications: Catalog design, store layout, cross-marketing… • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  4. Definition Item • In transactional databases: Any of the items included in a transaction. • In relational databases: (Attribute, value) pair k-itemset Set of k items Itemset support support(I) = P(I) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  5. Definition Association rule X  Y • Support support(XY) = support(XUY) = P(XUY) • Confidence confidence(XY) = support(XUY) / support(X) = P(Y|X)   NOTE: Both support and confidence are relative • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  6. Discovery Association rule mining Find all frequent itemsets Generate strong association rules from the frequent itemsetsStrong association rules are those that satisfy both a minimum support threshold and a minimum confidence threshold. • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  7. Discovery Apriori Observation: All non-empty subsets of a frequent itemset must also be frequent Algorithm: Frequent k-itemsets are used to explore potentially frequent (k+1)-itemsets (i.e. candidates)  Agrawal & Skirant: "Fast Algorithms for Mining Association Rules", VLDB'94 • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  8. Discovery Apriori improvements (I) • Reducing the number of candidates Park, Chen & Yu: "An Effective Hash-Based Algorithm for Mining Association Rules", SIGMOD'95 • SamplingToivonen: "Sampling Large Databases for Association Rules", VLDB'96 Park, Yu & Chen: "Mining Association Rules with Adjustable Accuracy", CIKM'97 • PartitioningSavasere, Omiecinski & Navathe: "An Efficient Algorithm for Mining Association Rules in Large Databases", VLDB'95 • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  9. Discovery Apriori improvements (II) • Transaction reduction Agrawal & Skirant: "Fast Algorithms for Mining Association Rules", VLDB'94 (AprioriTID) • Dynamic itemset countingBrin, Motwani, Ullman & Tsur: "Dynamic Itemset Counting and Implication Rules for Market Basket Data", SIGMOD'97 (DIC)Hidber: "Online Association Rule Mining", SIGMOD'99 (CARMA) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  10. Discovery Apriori-like algorithm: TBAR (Tree-based association rule mining) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR Berzal, Cubero, Sánchez & Serrano “TBAR: An efficient method for association rule mining in relational databases” Data & Knowledge Engineering, 2001

  11. D #5 D #5 D #7 C #6 D #5 D #5 D #8 C #7 B #9 A #7 B #6 Discovery: TBAR • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR L1 7 instances wih A 6 instances withAB L2 5 instances withAD 6 instances withBC 5 instances withABD L3

  12. Discovery An alternative to Apriori: Compress the database representing frequent items into a frequent-pattern tree (FP-tree)…  Han, Pei & Yin: "Mining Frequent Patterns without Candidate Generation", SIGMOD'2000 • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  13. Discovery A challenge When an itemset is frequent,all its subsets are also frequent • Closed itemset C:There exists no proper super-itemset S such that support(S)=support(C) • Maximal (frequent) itemset M:M is frequent and there exists no super-itemset Y such that MY and Y is frequent. • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  14. Variations Based on the kinds of patterns to be mined: • Frequent itemset mining(transactional and relational data) • Sequential pattern mining(sequence data sets, e.g. bioinformatics) • Structured pattern mining(structured data, e.g. graphs) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  15. Variations Based on the types of values handled: • Boolean association rules • Quantitative association rules • Fuzzy association rules • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR  Delgado, Marín, Sánchez & Vila “Fuzzy association rules: General model and applications” IEEE Transactions on Fuzzy Systems, 2003

  16. Variations More options: • Generalized association rules(a.k.a. multilevel association rules) • Constraint-based association rule mining • Incremental algorithms • Top-k algorithms • … • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR ICDM FIMIWorkshop on Frequent Itemset Mining Implementations http://fimi.cs.helsinki.fi/

  17. Visualization Integrated into data mining tools to help users understand data mining results: • Table-based approache.g. SAS Enterprise Miner, DBMiner… • 2D Matrix-based approache.g. SGI MineSet, DBMiner… • Graph-based techniquese.g. DBMiner ball graphs • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  18. Visualization: Tables • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  19. Visualization: Visual aids • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  20. Visualization: 2D Matrix • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  21. Visualization: Graphs • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  22. Visualization: VisAR Based on parallel coordinates (Techapichetvanich & Datta, ADMA’2005) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  23. Extensions Confidence is not the best possibleinterestingness measure for rules e.g. A very frequent item will always appear in rule consequents, regardless its true relationship with the rule antecedent X went to war  X did not serve in Vietnam (from the US Census) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  24. Extensions Desirable properties for interestingness measuresPiatetsky-Shapiro, 1991 P1 ACC(A⇒C) = 0 when supp(A⇒C) = supp(A)supp(C) P2 ACC(A⇒C) monotonically increases with supp(A⇒C) P3 ACC(A⇒C) monotonically decreases with supp(A) (or supp(C)) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  25. Extensions Certainty factors… • … satisfy Piatetsky-Shapiro’s properties • … are widely-used in expert systems • … are not symmetric (as interest/lift) • … can substitute conviction when CF>0 • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR  Berzal, Blanco, Sánchez & Vila:“Measuring the accuracy and interest of association rules: A new framework", Intelligent Data Analysis, 2002

  26. Extensions References: • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR  Hilderman & Hamilton: “Evaluation of interestingness measures for ranking discovered knowledge”. PAKDD, 2001  Tan, Kumar & Srivastava: “Selecting the right objective measure for association analysis”. Information Systems, vol. 29, pp. 293-313, 2004.  Berzal, Cubero, Marín, Sánchez, Serrano & Vila: “Association rule evaluation for classification purposes” TAMIDA’2005

  27. Applications Two sample applications where associations rules have been successful • Classification (ART) • Anomaly detection (ATBAR) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR Berzal, Cubero, Sánchez & Serrano “ART: A hybrid classification model” Machine Learning Journal, 2004 Balderas, Berzal, Cubero, Eisman & Marín “Discovering Hidden Association Rules ” KDD’2005, Chicago, Illinois, USA

  28. Classification Classification models based on association rules • Partial classification models vg: Bayardo • “Associative” classification models vg: CBA (Liu et al.) • Bayesian classifiers vg: LB (Meretakis et al.) • Emergent patterns vg: CAEP (Dong et al.) • Rule trees vg: Wang et al. • Rules with exceptions vg: Liu et al. • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  29. Classification GOAL Simple, intelligible, and robust classification models obtained in an efficient and scalable way MEANS • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR Decision Tree Induction + Association Rule Mining = ART [Association Rule Trees]

  30. ART Classification Model IDEA Make use of efficient association rule mining algorithms to build a decision-tree-shaped classification model. ART = Association Rule Tree KEY Association rules + “else” branches Hybrid between decision trees and decision lists • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  31. ART Classification Model SPLICE • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  32. ART classification model Example ART vs. TDIDT ART TDIDT • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  33. ART classification model Final comments Classification models • Acceptable accuracy • Reduced complexity • Attribute interactions • Robustness (noise & primary keys) Classifier building method • Efficient algorithm • Good scalability properties • Automatic parameter selection • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  34. Anomaly detection It is often more interesting to find surprising non-frequent events than frequent ones EXAMPLES • Abnormal network activity patterns in intrusion detection systems. • Exceptions to “common” rules in Medicine (useful for diagnosis, drug evaluation, detection of conflicting therapies…) • … • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  35. Anomaly detection Anomalous association rule Confident rule representing homogeneous deviations from common behavior. • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  36. X usually implies Y (dominant rule) X Y frequent and confident Anomaly detection • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR When X does not imply Y, then it usually implies A (the Anomaly) X ¬Y  A confident Anomalous association rule X Y  ¬A confident

  37. Anomaly detection • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR X Y is the dominant rule X A when ¬ Yis the anomalous rule

  38. Anomaly detection Suzuki et al.’s “Exception Rules” • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR X Y is an association rule X  I ¬ Y is the exception rule I is the “interacting” itemset X  I is the reference rule • Too many exceptions • The “cause” needs to be present

  39. A#7 AB#6 AC#4 AD#5 AE#3 AF#3 B #9 C #7 D #8 A #7 B #6 D #5 A #7 A* Non-frequent Anomaly detection: ATBAR Anomalous association rules • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR First scan Second scan

  40. A #7 A* C #7C* B #9B* D #5 B #6 D #8D* D #5 C #6 D #8 C #7 B #9 D #7 A #7 Anomaly detection: ATBAR Anomalous association rules • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR First scan Second scan

  41. Anomaly detection: ATBAR Anomalous association rules Rule generation is immediate from the frequent and extended itemsets obtained by ATBAR • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  42. Anomaly detection: Results Experiments on health-related datasetsfrom the UCI Machine Learning Repository • Relatively small set of anomalous rules (typically, >90% reduction with respect to standard association rules) • Reasonable overhead needed to obtain anomalous association rules(about 20% in ATBAR w.r.t. TBAR) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

  43. “Anomaly” Usual consequent Anomaly detection: Results An example from the Census dataset: • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR if WORKCLASS: Local-gov then CAPGAIN: [99999.0 , 99999.0] (7 out of 7) when not CAPGAIN: [0.0 , 20051.0]

  44. Anomaly detection: Results • Anomalous association rules(novel characterization of potentially interesting knowledge) • An efficient algorithm for discovering anomalous association rules: ATBAR • Some heuristics for filtering the discovered anomalous association rules • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR

More Related