190 likes | 546 Views
An Efficient Algorithm for Incremental Mining of Association Rules. Chin-Chen Chang, Yu-Chiang Li, Jung-San Lee RIDE-SDMA ’ 05 Speaker :董原賓 Advisor : 柯佳伶. Introduction. Previous incremental mining algorithms FUP (Fast Update Algorithm)
E N D
An Efficient Algorithm for Incremental Mining of Association Rules Chin-Chen Chang, Yu-Chiang Li, Jung-San Lee RIDE-SDMA’05 Speaker:董原賓 Advisor:柯佳伶
Introduction • Previous incremental mining algorithms • FUP (Fast Update Algorithm) • FUP2 • negative border ※They all have to rescan the originally database • Problem • Publication-like database • EX:Publication database, web log records, etc. • The original database is normally much larger than the incremental database • Solution • NFUP (New Fast Update Algorithm)
Definition • DB:original database • db:the set of newly added transactions • DB+:DB + db • n, Pn:db is divided into n partitions, db = P1UP2U,…,UPn-1UPn • dbm,n = PmUPm+1U,…,UPn-1UPn
Definition • α set: frequent itemsets in DB+ • β set: frequent in dbm,n , (m ≤ n), but infrequent in dbm-1,n • γ set: frequent in dbm,m, but infrequent in dbm+1,n • X.count:occurrence count • X.start:partition number when X becomes frequent • X.type:denotes one of the three types α,β, and γ
FUP (Fast Update Algorithm) • In case2, itemset is easily calculated • In case3, FUP needs to rescan the original database
NFUP (New Fast Update Algo.) • A backward method that only requires scanning incremental database • A frequent itemset in the incremental database is also important even if it is infrequent in the updated database • Partition the incremental database (db) by the time interval
NFUP • The frequent set of itemsets of DB is known in advance • NFUP scans each partition backward, the last partition is scanned first • In each partition, the process is performed like that of Apriori.
Scan from Pn to P1 and find the α,β,γ itemsets in db After P1 is scanned, the occurrence count is accumulated with itemsets of DB
The latest partition is scanned first, initialize variables and accumulate the occurrence Still frequent in Pmthen accumulate count Still frequent in dbm,n then accumulate count Only frequent in dbm+1,n then Remove from α set and add Into β set Not belong to any set and frequent in Pm then check if Pm is the latest partition Yes α set No γ set
Example Min sup = 50% {AB: 2} {AC: 2} {AF: 1} {BC: 2} {BF: 1} {CF: 2} {ABC: 2} {A: 2} {B: 2} {C: 3} {D: 1} {E: 1} {F: 2} Check if itemset belongs to α set Check if itemset’s count >= 1.5 Check if P2 is the latest partition yes α no γ Else check itemset doesn’t belongs to any set Check if itemset belongs to α set Else check itemset doesn’t belong to any set Check if P2 is the latest partition yes α no γ Run Apriori-gen Scan p2 : 1-itemset Check if itemset’s count >= 1.5 scan P2 : 2-itemset Scan P2 : 3-itemset 3 x 0.5 = 1.5 α set start count β set start count γ set start count {A} 2 2 {B} 2 2 {C} 2 3 {F} 2 2 {AB} 2 2 {AC} 2 2 {BC} 2 2 {CF} 2 2 {ABC} 2 2
Example Min sup = 50% 3 x 0.5 = 1.5 {AB: 1} {AC: 0} {BC: 2} {BE: 3} {CE: 2} {A: 1} {B: 3} {C: 2} {D: 1} {E: 3} {F: 0} Check if itemset belongs to α set Run Apriori-gen Check if P1 is the latest partition yes α no γ Check itemset doesn’t belongs to any set Else check if itemset’s count >= 1.5 scan P1 : 2-itemset Yesaccumulate count Count < s*|dbm,n| = 0.5x6 = 3 β set Check if P1 is the latest partition yes α no γ Else check if itemset’s count >= 1.5 Check itemset doesn’t belong to any set Check if itemset belongs to α set Yesaccumulate count Count < s*|dbm,n| = 0.5x6 = 3 β set Scan p1 : 1-itemset α set start count β set start count γ set start count {A} 2 1 3 2 {F} 2 2 {E} 1 3 {B} 1 2 2 5 {AC} 2 2 {BE} 1 3 {C} 2 1 5 3 {CF} 2 2 {CE} 1 2 {F} 2 2 {ABC} 2 2 {AB} 1 2 3 2 {AC} 2 2 {BC} 2 1 2 4 {CF} 2 2 {ABC} 2 2
Example α set start count β set start count γ set start count {A} 1 0 3 7 {F} 2 2 {E} 1 3 {B} 0 1 5 8 {AC} 2 2 {BE} 1 3 {C} 0 1 9 5 {CF} 2 2 {CE} 1 2 {AB} 1 3 {ABC} {AB} 2 1 2 3 {AE} 0 3 {BC} 1 4 {BC} 1 4 {ABC} 2 2
Experiment • Intel Pentium IV 1.5GHz CPU, 640 MB main memory • Microsoft Windows 2000 Professional • Synthetic datasets: