450 likes | 619 Views
Fundamentos de Minería de Datos. Reglas de asociación. Fernando Berzal fberzal@decsai.ugr.es. Motivation. Association mining searches for interesting relationships among items in a given data set EXAMPLES Diapers and six-packs are bought together, specially on Thursday evening (a myth?)
E N D
Fundamentos de Minería de Datos Reglas de asociación Fernando Berzalfberzal@decsai.ugr.es
Motivation Association mining searches for interesting relationships among items in a given data set EXAMPLES • Diapers and six-packs are bought together, specially on Thursday evening (a myth?) • A sequence such as buying first a digital camera and then a memory card is a frequent (sequential) pattern • … • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Motivation MARKET BASKET ANALYSIS The earliest form of association rule mining Applications: Catalog design, store layout, cross-marketing… • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Definition Item • In transactional databases: Any of the items included in a transaction. • In relational databases: (Attribute, value) pair k-itemset Set of k items Itemset support support(I) = P(I) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Definition Association rule X Y • Support support(XY) = support(XUY) = P(XUY) • Confidence confidence(XY) = support(XUY) / support(X) = P(Y|X) NOTE: Both support and confidence are relative • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Discovery Association rule mining Find all frequent itemsets Generate strong association rules from the frequent itemsetsStrong association rules are those that satisfy both a minimum support threshold and a minimum confidence threshold. • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Discovery Apriori Observation: All non-empty subsets of a frequent itemset must also be frequent Algorithm: Frequent k-itemsets are used to explore potentially frequent (k+1)-itemsets (i.e. candidates) Agrawal & Skirant: "Fast Algorithms for Mining Association Rules", VLDB'94 • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Discovery Apriori improvements (I) • Reducing the number of candidates Park, Chen & Yu: "An Effective Hash-Based Algorithm for Mining Association Rules", SIGMOD'95 • SamplingToivonen: "Sampling Large Databases for Association Rules", VLDB'96 Park, Yu & Chen: "Mining Association Rules with Adjustable Accuracy", CIKM'97 • PartitioningSavasere, Omiecinski & Navathe: "An Efficient Algorithm for Mining Association Rules in Large Databases", VLDB'95 • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Discovery Apriori improvements (II) • Transaction reduction Agrawal & Skirant: "Fast Algorithms for Mining Association Rules", VLDB'94 (AprioriTID) • Dynamic itemset countingBrin, Motwani, Ullman & Tsur: "Dynamic Itemset Counting and Implication Rules for Market Basket Data", SIGMOD'97 (DIC)Hidber: "Online Association Rule Mining", SIGMOD'99 (CARMA) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Discovery Apriori-like algorithm: TBAR (Tree-based association rule mining) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR Berzal, Cubero, Sánchez & Serrano “TBAR: An efficient method for association rule mining in relational databases” Data & Knowledge Engineering, 2001
D #5 D #5 D #7 C #6 D #5 D #5 D #8 C #7 B #9 A #7 B #6 Discovery: TBAR • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR L1 7 instances wih A 6 instances withAB L2 5 instances withAD 6 instances withBC 5 instances withABD L3
Discovery An alternative to Apriori: Compress the database representing frequent items into a frequent-pattern tree (FP-tree)… Han, Pei & Yin: "Mining Frequent Patterns without Candidate Generation", SIGMOD'2000 • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Discovery A challenge When an itemset is frequent,all its subsets are also frequent • Closed itemset C:There exists no proper super-itemset S such that support(S)=support(C) • Maximal (frequent) itemset M:M is frequent and there exists no super-itemset Y such that MY and Y is frequent. • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Variations Based on the kinds of patterns to be mined: • Frequent itemset mining(transactional and relational data) • Sequential pattern mining(sequence data sets, e.g. bioinformatics) • Structured pattern mining(structured data, e.g. graphs) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Variations Based on the types of values handled: • Boolean association rules • Quantitative association rules • Fuzzy association rules • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR Delgado, Marín, Sánchez & Vila “Fuzzy association rules: General model and applications” IEEE Transactions on Fuzzy Systems, 2003
Variations More options: • Generalized association rules(a.k.a. multilevel association rules) • Constraint-based association rule mining • Incremental algorithms • Top-k algorithms • … • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR ICDM FIMIWorkshop on Frequent Itemset Mining Implementations http://fimi.cs.helsinki.fi/
Visualization Integrated into data mining tools to help users understand data mining results: • Table-based approache.g. SAS Enterprise Miner, DBMiner… • 2D Matrix-based approache.g. SGI MineSet, DBMiner… • Graph-based techniquese.g. DBMiner ball graphs • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Visualization: Tables • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Visualization: Visual aids • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Visualization: 2D Matrix • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Visualization: Graphs • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Visualization: VisAR Based on parallel coordinates (Techapichetvanich & Datta, ADMA’2005) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Extensions Confidence is not the best possibleinterestingness measure for rules e.g. A very frequent item will always appear in rule consequents, regardless its true relationship with the rule antecedent X went to war X did not serve in Vietnam (from the US Census) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Extensions Desirable properties for interestingness measuresPiatetsky-Shapiro, 1991 P1 ACC(A⇒C) = 0 when supp(A⇒C) = supp(A)supp(C) P2 ACC(A⇒C) monotonically increases with supp(A⇒C) P3 ACC(A⇒C) monotonically decreases with supp(A) (or supp(C)) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Extensions Certainty factors… • … satisfy Piatetsky-Shapiro’s properties • … are widely-used in expert systems • … are not symmetric (as interest/lift) • … can substitute conviction when CF>0 • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR Berzal, Blanco, Sánchez & Vila:“Measuring the accuracy and interest of association rules: A new framework", Intelligent Data Analysis, 2002
Extensions References: • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR Hilderman & Hamilton: “Evaluation of interestingness measures for ranking discovered knowledge”. PAKDD, 2001 Tan, Kumar & Srivastava: “Selecting the right objective measure for association analysis”. Information Systems, vol. 29, pp. 293-313, 2004. Berzal, Cubero, Marín, Sánchez, Serrano & Vila: “Association rule evaluation for classification purposes” TAMIDA’2005
Applications Two sample applications where associations rules have been successful • Classification (ART) • Anomaly detection (ATBAR) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR Berzal, Cubero, Sánchez & Serrano “ART: A hybrid classification model” Machine Learning Journal, 2004 Balderas, Berzal, Cubero, Eisman & Marín “Discovering Hidden Association Rules ” KDD’2005, Chicago, Illinois, USA
Classification Classification models based on association rules • Partial classification models vg: Bayardo • “Associative” classification models vg: CBA (Liu et al.) • Bayesian classifiers vg: LB (Meretakis et al.) • Emergent patterns vg: CAEP (Dong et al.) • Rule trees vg: Wang et al. • Rules with exceptions vg: Liu et al. • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Classification GOAL Simple, intelligible, and robust classification models obtained in an efficient and scalable way MEANS • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR Decision Tree Induction + Association Rule Mining = ART [Association Rule Trees]
ART Classification Model IDEA Make use of efficient association rule mining algorithms to build a decision-tree-shaped classification model. ART = Association Rule Tree KEY Association rules + “else” branches Hybrid between decision trees and decision lists • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
ART Classification Model SPLICE • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
ART classification model Example ART vs. TDIDT ART TDIDT • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
ART classification model Final comments Classification models • Acceptable accuracy • Reduced complexity • Attribute interactions • Robustness (noise & primary keys) Classifier building method • Efficient algorithm • Good scalability properties • Automatic parameter selection • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Anomaly detection It is often more interesting to find surprising non-frequent events than frequent ones EXAMPLES • Abnormal network activity patterns in intrusion detection systems. • Exceptions to “common” rules in Medicine (useful for diagnosis, drug evaluation, detection of conflicting therapies…) • … • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Anomaly detection Anomalous association rule Confident rule representing homogeneous deviations from common behavior. • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
X usually implies Y (dominant rule) X Y frequent and confident Anomaly detection • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR When X does not imply Y, then it usually implies A (the Anomaly) X ¬Y A confident Anomalous association rule X Y ¬A confident
Anomaly detection • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR X Y is the dominant rule X A when ¬ Yis the anomalous rule
Anomaly detection Suzuki et al.’s “Exception Rules” • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR X Y is an association rule X I ¬ Y is the exception rule I is the “interacting” itemset X I is the reference rule • Too many exceptions • The “cause” needs to be present
A#7 AB#6 AC#4 AD#5 AE#3 AF#3 B #9 C #7 D #8 A #7 B #6 D #5 A #7 A* Non-frequent Anomaly detection: ATBAR Anomalous association rules • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR First scan Second scan
A #7 A* C #7C* B #9B* D #5 B #6 D #8D* D #5 C #6 D #8 C #7 B #9 D #7 A #7 Anomaly detection: ATBAR Anomalous association rules • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR First scan Second scan
Anomaly detection: ATBAR Anomalous association rules Rule generation is immediate from the frequent and extended itemsets obtained by ATBAR • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
Anomaly detection: Results Experiments on health-related datasetsfrom the UCI Machine Learning Repository • Relatively small set of anomalous rules (typically, >90% reduction with respect to standard association rules) • Reasonable overhead needed to obtain anomalous association rules(about 20% in ATBAR w.r.t. TBAR) • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR
“Anomaly” Usual consequent Anomaly detection: Results An example from the Census dataset: • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR if WORKCLASS: Local-gov then CAPGAIN: [99999.0 , 99999.0] (7 out of 7) when not CAPGAIN: [0.0 , 20051.0]
Anomaly detection: Results • Anomalous association rules(novel characterization of potentially interesting knowledge) • An efficient algorithm for discovering anomalous association rules: ATBAR • Some heuristics for filtering the discovered anomalous association rules • Motivation • Definition • Discovery • Variations • Visualization • Extensions • Applications • ART • ATBAR