CS 590M Fall 2001: Security Issues in Data Mining

CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations

Why Association Rules? • Understand attributes, not entities • Discover relationships that • Show some dependency between attributes • Are “interesting” • Give an understanding of the data space

Formal Definition • Data: • Items={i1,…,in} • Transactions T={t1,…,tm} where ti = {ij1, …, ijk} • Support: Given AI, supp(A) = |{t  T | t  A}| / |T| • Goal: Find rules AB with support ≥ s and confidence ≥ c where: • A, B  I, A  B =  • s = supp(A  B), c = supp(A  B) / supp(A)

Sample: Market Basket

Types of associations • Machine-learning base: classification / decision rules • Entities independent, unordered • Find rules leading to target class • To get rule sets, re-run for all classes as targets • Market-basket • Collection of related entities with same key • Can be modeled as independent entities, sparse data • Sequential • Like market basket, but group by distance rather than same key

Historical Association Rule Learning • Decision tree converted to rules • ID3, as discussed in previous lecture • Direct production of decision rules • CN2, others • Problem: Algorithms don’t scale well to many practical problems

Database community contribution: Market Basket Association Rules • Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, pages 207--216, Washington, D.C., May 1993. • Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499

Database community contribution: Market Basket Association Rules • Practical problems often have sparse data • Many attributes, few items per transaction • Goal is typically search for high support • High support = broad impact • High confidence not crucial (as opposed to classification) • Very Large data sets(main-memory algorithms impractical)

A-Priori Algorithm • Observation: if A has support s, then • i  A, supp(i) ≥ s • Gives bottom-up algorithm • Find single items with support ≥ s • Just look at transaction subsets with those items for pairs • Recurse

A-Priori Algorithm • First, generate all large itemsets • Sets X  I such that supp(X) ≥ s (threshold) • Captures “supp(A  B) ≥ s” part of problem • Second, find high-confidence rules that are subsets of X • B = Xi , A = X-B • To find confidence, need supp(A)But A will be in all large itemsets – don’t need to go back to the database!

A-Priori Algorithm L1 = {large 1-itemsets}; for ( k = 2; Lk-1  ; k++ ) Ck = select p.i1, p.iY, …, p.ik-1, q.ik-1 from Lk-1 p, Lk-1 q where p.i1 = q.i1, …, p.ik-2 = q.ik-2  transactions t  T Ct = subset(Ck, t); // Candidates contained in t  candidates c  Ct: c.count++; Lk = {c  Ck | c.count  minsup} Answer = k Lk;

Frequent episodes for sequential associations • Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo: Discovering Frequent Episodes in Sequences. In First International Conference on Knowledge Discovery and Data Mining (KDD'95), 210 - 215, Montreal, Canada, August 1995. AAAI Press. • Instead of transaction, items grouped by sliding window in time • Same basic idea as A-Priori

Frequent Episodes: Definition • Event types E • Event (A,t) where A in E • Sequence S=((A1,t1),…,(An,tn)) • Frequent episode F = (Ai, …, Aj) where •  tl, tm such that t1tl<…<tm  tn tm-tl  window: • count( ((Ai,tl), …, (Aj, tm)) )  support

Applications/Issues in Security • Frequent episodes in intrusion detection data • What does this tell us? • Preventing the discovery of associations • Known items to protect • What if we don’t know what we want to protect?

CS 590M Fall 2001: Security Issues in Data Mining

CS 590M Fall 2001: Security Issues in Data Mining

Presentation Transcript

Data Mining with Clementine

Frequent Item Mining

CS490D: Introduction to Data Mining Prof. Walid Aref

DATA MINING Introductory and Advanced Topics Part I

Drug Safety Assessment and Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

DATA WAREHOUSING AND DATA MINING

Advanced Topics in Data Mining: Association Rules

DATA WAREHOUSING AND DATA MINING

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

CSE 538 Web Search and Mining Web Crawling

DATA MINING LECTURE 5

Data Mining Tools

CS490D: Introduction to Data Mining Prof. Chris Clifton

Data Mining with DB

Data Mining using Fractals and Power laws

Data Mining with CANape 9.0

Software Design CS 406 Software Engineering I Fall 2001