140 likes | 258 Views
CS 590M Fall 2001: Security Issues in Data Mining. Lecture 5: Association Rules, Sequential Associations. Why Association Rules?. Understand attributes , not entities Discover relationships that Show some dependency between attributes Are “interesting”
E N D
CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations
Why Association Rules? • Understand attributes, not entities • Discover relationships that • Show some dependency between attributes • Are “interesting” • Give an understanding of the data space
Formal Definition • Data: • Items={i1,…,in} • Transactions T={t1,…,tm} where ti = {ij1, …, ijk} • Support: Given AI, supp(A) = |{t T | t A}| / |T| • Goal: Find rules AB with support ≥ s and confidence ≥ c where: • A, B I, A B = • s = supp(A B), c = supp(A B) / supp(A)
Types of associations • Machine-learning base: classification / decision rules • Entities independent, unordered • Find rules leading to target class • To get rule sets, re-run for all classes as targets • Market-basket • Collection of related entities with same key • Can be modeled as independent entities, sparse data • Sequential • Like market basket, but group by distance rather than same key
Historical Association Rule Learning • Decision tree converted to rules • ID3, as discussed in previous lecture • Direct production of decision rules • CN2, others • Problem: Algorithms don’t scale well to many practical problems
Database community contribution: Market Basket Association Rules • Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, pages 207--216, Washington, D.C., May 1993. • Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499
Database community contribution: Market Basket Association Rules • Practical problems often have sparse data • Many attributes, few items per transaction • Goal is typically search for high support • High support = broad impact • High confidence not crucial (as opposed to classification) • Very Large data sets(main-memory algorithms impractical)
A-Priori Algorithm • Observation: if A has support s, then • i A, supp(i) ≥ s • Gives bottom-up algorithm • Find single items with support ≥ s • Just look at transaction subsets with those items for pairs • Recurse
A-Priori Algorithm • First, generate all large itemsets • Sets X I such that supp(X) ≥ s (threshold) • Captures “supp(A B) ≥ s” part of problem • Second, find high-confidence rules that are subsets of X • B = Xi , A = X-B • To find confidence, need supp(A)But A will be in all large itemsets – don’t need to go back to the database!
A-Priori Algorithm L1 = {large 1-itemsets}; for ( k = 2; Lk-1 ; k++ ) Ck = select p.i1, p.iY, …, p.ik-1, q.ik-1 from Lk-1 p, Lk-1 q where p.i1 = q.i1, …, p.ik-2 = q.ik-2 transactions t T Ct = subset(Ck, t); // Candidates contained in t candidates c Ct: c.count++; Lk = {c Ck | c.count minsup} Answer = k Lk;
Frequent episodes for sequential associations • Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo: Discovering Frequent Episodes in Sequences. In First International Conference on Knowledge Discovery and Data Mining (KDD'95), 210 - 215, Montreal, Canada, August 1995. AAAI Press. • Instead of transaction, items grouped by sliding window in time • Same basic idea as A-Priori
Frequent Episodes: Definition • Event types E • Event (A,t) where A in E • Sequence S=((A1,t1),…,(An,tn)) • Frequent episode F = (Ai, …, Aj) where • tl, tm such that t1tl<…<tm tn tm-tl window: • count( ((Ai,tl), …, (Aj, tm)) ) support
Applications/Issues in Security • Frequent episodes in intrusion detection data • What does this tell us? • Preventing the discovery of associations • Known items to protect • What if we don’t know what we want to protect?