Association Analysis

Association Analysis

Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction; Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

Items and transactions • Let: • I = {i1, i2,…,id } be the set of all items in a market basket data and • T = {t1, t2 ,…, tN } be the set of all transactions. • Each transaction ti contains a subset of items chosen from I. • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Transaction width • The number of items present in a transaction. • A transaction tj is said to contain an itemset X, if X is a subset of tj. • E.g., the second transaction contains the itemset {Bread, Diapers} but not {Bread, Milk}.

Definition: Frequent Itemset • Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread, Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 = /N • Frequent Itemset • An itemset whose support is greater than or equal to a minsup threshold

Rule Measure: Support and Confidence • Find all the rules X & Y  Z with minimum confidence and support • support, s, probability that a transaction contains {X U Y U Z} • confidence, c, conditional probability that a transaction having {X U Y} alsocontains Z Let minimum support 50%, and minimum confidence 50%, we have • A C (50%, 66.6%) • C  A (50%, 100)

Example: Definition: Association Rule • Association Rule • An implication expression of the form X Y, where X and Y are itemsets • Example:{Milk, Diaper}  {Beer} • Rule Evaluation Metrics (X Y) • Support (s) • Fraction of transactions that contain both X and Y • Confidence (c) • Measures how often items in Yappear in transactions thatcontain X

Mining Association Rules Mining Association Rules - An Example For rule A ⇒ C: Support =P(A U C =50% Confidence =P(C|A)=66.6% So, rule A ⇒ C is acceptable because it has achieved Sup & Conf.

Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having • support minsupthreshold • confidence minconfthreshold • Brute-force approach: • List all possible association rules • Compute the support and confidence for each rule • Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67){Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements If the itemset is infrequent, then all six candidate rules can be pruned immediately without us having to compute their confidence values.

Ass.Rule example

Mining Association Rules • Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support  minsup(these itemsets are called frequent itemset) • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset (these rules are called strong rules) • The computational requirements for frequent itemset generation are more expensive than those of rule generation. • We focus first on frequent itemset generation.

Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets

Frequent Itemset Generation • Brute-force approach: • Each itemset in the lattice is a candidate frequent itemset • Count the support of each candidate by scanning the database • Match each transaction against every candidate • Complexity ~ O(NMw) => Expensive since M = 2d!!! • w is max transaction width.

Computational Complexity • Given d unique items: • – Total number of itemsets = 2d • – Total number of possible association rules:

Frequent Itemset Generation Strategies • Reduce the number of candidates (M) – Use pruning techniques to reduce M – Complete search: M=2d –• Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases – Used by DHP(dynamic hashing and pruning algor.) and vertical-based mining algorithms • Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction

Reducing Number of Candidates • Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent • This is due to the anti-monotone property of support The support of itemset never exceeds the support of its subset • Apriori principle conversely said: • If an itemset such as {a, b} is infrequent, then all of its supersets must be infrequent too.

Illustrating Apriori Principle Found to be Infrequent Pruned supersets Infrequent (its support < minsup)

Pairs (2-itemsets) (No need to generatecandidates involving Cokeor Eggs) Triplets (3-itemsets) With the Apriori principle we need to keep only this triplet, because it’s the only one whose subsets are all frequent. Illustrating Apriori Principle Items (1-itemsets) Minimum Support = 3 If every subset is considered, 6C1 + 6C2 + 6C3 = =6+15+20= 41 With support-based pruning, 6C1 + 4C2 +1= 6 + 6 + 1 = 13 هذا ما حققه لنا مبدأ الابريوري

Apriori Algorithm • Method: • Let k=1 • Generate frequent itemsets of length 1 • Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k frequent itemsets • Prune candidate itemsets containing subsets of length k that are infrequent • Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those that are frequent

Mining Frequent Itemsets (the Key Step) • Find the frequent itemsets: the sets of items that have minimum support • - A subset of a frequent itemset must also be a frequent itemset • • i.e., if {AB} is a frequent itemset, both { A} and { B} should be a frequent itemset • – Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) • • Use the frequent itemsets to generate association rules.

The Apriori Algorithm

Association Analysis