260 likes | 584 Views
Hash-Based Algorithm for Mining Association Rules. Data Mining. Mining Association Rules. Mining Association Rules. Mining Association Rules Support Obtain Large Itemset Confidence Generate Association Rules. Apriori - رويكرد مبتني بر
E N D
Data Mining • Mining Association Rules
Mining Association Rules • Mining Association Rules • Support • Obtain Large Itemset • Confidence • Generate Association Rules
Apriori -رويكرد مبتني بر • ابتدا در ميان مجموعه ساختار هاي داده شده به دنبال زيرساختارهاي متناوبي با اندازه Apriori در رويكرد مبتني بر • كوچك مي گرديم. پس از آن در هر مرحله با يك نود به يك زير ساختار متناوب، زير ساختار جديدي ايجاد مي شود. • براي افزودن نودها به يك زير ساختار متناوب، تنها نودهايي مورد استفاده قرار م يگيرند كه در مرحله اول به عنوان • نود متناوب شناخته شده باشند. با ايجاد زير ساختار جديد، مجموعه ساختارها براي مشخص شدن تناوب يا عدم • تناوب زيرساختار جديد مورد پويش قرار م يگيرد.
D C1 Scan D Scan D Scan D L1 Apriori C2 Sup=2 C2 L2 C3 C3 L3
Apriori Cont. • Disadvantages • Inefficient • Produce much more useless candidates
DHP • Prune useless candidates in advance • Reduce database size at each iteration DHP Direct Hashing with Efficient Pruning for Fast Data Mining
D Min sup=2 H{[x y]}=((order of x )*10+(order of y)) mod 7; Hash table H2 Hash address Bit vector The number of items hashed to bucket 0
Motivation • Apriori and DHP produce Ci from Li-1 that may be the bottleneck • Collisions in DHP • Designing a perfect hashing function for every transaction databases is a thorny problem
Definition • Definition. A Join operation is to join two different (k-1)-itemsets, , respectively, to produces a k-itemset, where • = p1p2…pk-1 • = q1q2…qk-1 and • p2=q1, p3=q2,…,pk-2=qk-3, pk-1=qk-2. • Example: ABC, BCD • 3-itemsets of ABCD: ABC, ABD, ACD, BCD • only one pair that satisfies the join definition
Algorithm • PHS (Perfect Hashing and Data Shrinking)
TID Encoding TID Itemsets Items (BC) A Items (BD) B (BE) C (CD) D (CE) (DE) L1 100 100 (CD) ACD Original (BC) (BE) (CD) (CE) Support 2 1 3 2 2 1 200 200 (BC) (BE)(CE) BCE 300 300 (BC)(BD)(BE)(CD)(CE)(DE) BCDE 400 BE 400 (BE) Example1 (sup=2)
TID Encoding Itemsets Items (AB) A (AC) (AD) (BC) (BD) (CD) 100 Original Null (AD) Support 0 1 2 0 0 0 200 (AD) 300 (AC)(AD) 400 Null Example2 (sup=2) Decode -> (BC)(CE) = BCE
Problem on Hash Table • Consider a database contains p transactions, which are comprised of unique items and are of equal length N, and the minimum support of 1. Loading density :
Itemsets (AB) (AC) A (AD) (BC) B (BD) C (CD) Hash Table C D Null Null Support 0 1 2 0 0 0 Count 1 2 How to Improve the Loading Density • Two level perfect hash scheme (parital hash)
Conclusions We examined in this paper the issue of mining association rules among items in a large database of sales transactions. The problem of discovering large itemsets was solved by constructing a candidate set of itemsets first and then, identifying, within this candidate set, those itemsets that meet the large itemset requirement