300 likes | 647 Views
Parallel Association Rule Mining. Presented by: Ramoza Ahsan and Xiao Qin November 5 th , 2013. Outline. Background of Association Rule Mining Apriori Algorithm Parallel Association Rule Mining Count Distribution Data Distribution Candidate Distribution FP tree Mining and growth
E N D
Parallel Association Rule Mining Presented by: Ramoza Ahsan and Xiao Qin November 5th, 2013
Outline • Background of Association Rule Mining • Apriori Algorithm • Parallel Association Rule Mining • Count Distribution • Data Distribution • Candidate Distribution • FP tree Mining and growth • Fast Parallel Association Rule mining without candidate generation • More Readings
Association Rule Mining • Association rule mining • Finding interesting patterns in data. (Analysis of past transaction data can provide valuable information on customer buying behavior.) • Record usually contains transaction date and items bought. • Literature work more focused on serial mining. • Support and Confidence: Parameters for Association Rule mining.
Association rule Mining Parameters • The support ,supp(X),of an itemset X is proportion of transactions in the data set which contain the itemset. • Confidence of a Rule X->Y is the fraction of transactions containing X which also contain Y . i.e. supp(X U Y)/supp(X) Supp(milk,bread,egg)=1/5 and rule {milk,bread}->{egg} has confidence=0.5
Outline • Background of Association Rule Mining • Apriori Algorithm • Parallel Association Rule Mining • Count Distribution • Data Distribution • Candidate Distribution • FP tree Mining and growth • Fast Parallel Association Rule mining without candidate generation • FP tree over Hadoop
Apriori Algorithm Apriori runs in two steps. • Generation of candidate itemsets • Pruning of itemsets which are infrequent Level-wise generation of frequent itemsets. Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent.
Apriori Algorithm for generating frequent itemsets • Minimum support=2
Parallel Association Rule Mining • Paper presents parallel algorithm for generating frequent itemsets • Each of N procesor has private memory and disk. • Data is distributed evenly on the disks of every processor. • Count Distribution algorithm focusses on minimizing communication. • Data Distribution utilizes memory aggregation efficiently • Candidate Distribution reduces synchronization between processors.
Algorithm 1: Count Distribution • Each processor generates complete Ck,using complete frequent itemset Lk-1. • Processor traverses over its local data partition and develops local support counts. • Exchange the counts with other processors to develop global count. Synchronization is needed. • Each processor computes Lk from Ck. • Each processor makes a decision to continue or stop.
Algorithm 2: Data Distribution • Partition the dataset into N small chunks • Partition the set of candidates k-itemsets into N exclusive subsets. • Each node (N total) takes one subset. Each node count the frequency of the itemsets in one chunk until it counts through all the chunks. • Aggregate the count.
Algorithm 2: Data Distribution Data 1/N Data Ck 1/N Ck 1/N Data 1/N Ck 1/N Ck 1/N Data 1/N Ck 1/N Ck 1/N Data 1/N Data
Algorithm 2: Data Distribution 1/N Data 1/N Data 1/N Ck 1/N Ck 1/N Data synchronize 1/N Ck 1/N Ck 1/N Data 1/N Ck 1/N Data
Algorithm 3: Candidates Distribution • If the workload is not balanced, this can cause all the processor to wait for whichever processor finishes last in every pass. • The Candidates Distribution Algorithm try to do away this dependencies by partition both the data and candidates.
Algorithm 3: Candidates Distribution Data Data_1 Data_2 Lk-1 Lk-1_1 Ck_1 Lk-1_2 Ck_2 Data_3 Lk-1_3 Ck_3 Lk-1_4 Ck_4 Data_4 Lk-1_5 Ck_5 Data_5
Algorithm 3: Candidates Distribution Data_1 Data_2 Lk-1_1 Ck_1 Data_3 Lk-1_2 Ck_2 Lk-1_3 Ck_3 Data_4 Lk-1_4 Ck_4 Lk-1_5 Ck_5 Data_5
Data Partition and L Partition • Data • Each pass, every node grabs the necessary tuples from the dataset. • L • Let L3={ABC, ABD, ABE, ACD, ACE} • The items in the itemsets are lexicographically ordered. • Partition the itemsets based on common k-1 long prefixes.
Rule Generation • Ex. • Frequent Itemset {ABCDE,AB} • The Rule that can be generated from this set is AB => CDE Support : Sup(ABCDE) Confidence : Sup(ABCDE)/Sup(AB)
Outline • Background of Association Rule Mining • Apriori Algorithm • Parallel Association Rule Mining • Count Distribution • Data Distribution • Candidate Distribution • FP tree Mining and growth • Fast Parallel Association Rule mining without candidate generation • FP tree over Hadoop
FP Tree Algorithm Allows frequent itemset discovery without candidate itemset generation: • Step 1: Build a compact data structure called FP-tree, built using 2 passes over the data set. • Step 2: Extracts frequent itemsets directly from the FP-tree
FP-Tree & FP-Growth example Min supp=3
Fast Parallel Association Rule Mining Without Candidacy Generation • Phase 1: • Each processor is given equal number of transactions. • Each processor locally counts the items. • Local count is summed up to get global count. • Infrequent items are pruned and frequent items are stored in header table in descending order of frequency. • construction of parallel frequent pattern trees for each processor. • Phase 2: mining of FP tree similar to FP growth algorithm using the global counts in header table.
Example with min supp=4 Step 4. After pruning infrequent ones Step 1
FP tree for P0 B:1 B:2 B:3 A:1 D:1 A:2 F:1 D:1 D:2 G:1 G:1
Frequent pattern strings • All frequent pattern trees are shared by all processors • Each generate conditional pattern base from respective items in header table • Merging all conditional pattern bases of same item yields frequent string. • If support of item is less than threshold it is not added in final frequent string.
More Readings [1] [2] [3] [4]
FP-Growth on Hadoop 3 Map-Reduce(s)
FP-Growth on Hadoop Core