Fast Algorithms for Mining Frequent Itemsets

探勘頻繁項目集合之快速演算法研究 Fast Algorithms for Mining Frequent Itemsets 博士論文初稿指導教授: 張真誠教授研究生: 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: May 31, 2007

Outline • Introduction • Background and Related Work • NFP-Tree Structure • Fast Share Measure (FSM) Algorithm • Three Efficient Algorithms • Direct Candidate Generate (DCG) Algorithm • Isolated Items Discarding Strategy (IIDS) • Maximum Item Conflict First (MICF) Sanitization Method • Conclusions

Introduction • Data mining techniques have been developed to find a small set of precious nugget from reams of data (Cabena et al., 1998; Kantardzic, 2002) • Mining association rules constitutes one of the most important data mining problem • Two sub-problem (Agrawal & Srikant, 1994) • Identifying all frequent itemsets • Using these frequent itemsets to generate association rules • The first sub-problem plays an essential role in mining association rules

Introduction (con’t) • Mining frequent itemsets • Mining share-frequent itemsets • Mining high utility itemsets • Hiding sensitive patterns

Support-Confidence Framework (1/4) Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%

Support-Confidence Framework (2/4) • FP-growth algorithm (Han et al., 2000; Han et al., 2004)  

Support-Confidence Framework (3/4)   

Support-Confidence Framework (4/4) Conditional FP-tree of “D” Conditional FP-tree of “BD”

Share-Confidence Framework (1/4) • Measure value: mv(ip, Tq) • mv({D}, T01) = 1 • mv({C}, T03) = 3 • Transaction measure value: tmv(Tq) = • tmv(T02) = 10 • Total measure value: Tmv(DB)= • Tmv(DB)=47 • Itemset measure value: imv(X, Tq)= • imv({A, E}, T02)=5 • Local measure value: lmv(X)= • lmv({BC})=2+5+5=12

Share-Confidence Framework (2/4) • Itemset share: SH(X)= • SH({BC})=12/47=25.5% • SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset minShare=30%

Share-Confidence Framework (3/4) • ZP(Zero Pruning)、ZSP(Zero Subset Pruning) (Barber & Hamilton, 2003) • variants of exhaustive search • prune the candidate itemsets whose local measure values are exactly zero • SIP(Share Infrequent Pruning) (Barber & Hamilton, 2003) • like Apriori • with errors • The three algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets

Share-Confidence Framework (4/4) ZSP Algorithm SIP Algorithm

Utility Mining (1/2) • Internal utility: iu(ip, Tq) • iu({D}, T01) = 1 • iu({C}, T03) = 3 • External utility: eu(ip) • eu({D}) = 3 • eu({C}) = 1 • Utility value in a transaction: • util({C, E, F}, T02) = util(C, T02) + util(E, T02) + util(F, T02) = 3X1+1X5+2X2=12 • Local utility: • Lutil({C, D}) = util({C, D}, T01) + util({C, D}, T04) + util({C, D}, T06) = 4 + 7 + 5 = 16

Utility Mining (2/2) • Total utility: Tutil(DB) = • Tutil(DB) = 122 • The utility value of X in DB: UTIL(X)= • UTIL({C, D}) = 16/122 =13.1% • High utility itemset: if UTIL(X) >= minUtil, X is a high utility itemset

Privacy-Preserving in Mining Frequent Itemsets • NP-hard problem (Atallah et al., 1999) • DB: database, DB’: released database • RI: the set of restrictive itemsets • ~RI: the set of non-restrictive itemsets • Misses cost = • Sanitization algorithms (Oliveira and Zaïane, 2002; Oliveira and Zaïane, 2003; Saygin et al., 2001)

NFP-Tree (1/4) • NFP-growth Algorithm • NFP-tree construction

NFP-Tree (2/4)  

NFP-Tree (3/4)   

NFP-Tree (4/4) Conditional NFP-tree of “D(3,4)”

Experimental Results (1/3) • PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional • All algorithms were coded in VC++ 6.0 • Datasets: • Real: BMS-Web View-1, BMS-Web View-2, Connect 4 • Artificial: generated by IBM synthetic data generator

Experimental Results (2/3)

Fast Share Measure (FSM) Algorithm • FSM: Fast Share Measure algorithm • ML: Maximum transaction length in DB • MV: Maximum measure valuein DB • min_lmv=minShare×Tmv • Level Closure Property: Given a minShare and a k-itemset X • Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all supersets of X with length k + 1 are infrequent • Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k’< min_lmv, all supersets of X with length k+k’ are infrequent • Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all supersets of X are infrequent

minShare=30% • Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) • Prune X if CF(X)<min_lmv • CF({ABC})=3+(3/3)×3×(6-3)=12<14.1=min_lmv

ExperimentalResults (1/2) • T4.I2.D100k.N50.S10 • minShare = 0.8% • ML=14

Three Efficient Algorithms • EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RCk-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently • Reduce time complexity from O(n2k-2) to O(nk)

Xk+1:arbitrary superset of X with length k+1 in DB • S(Xk+1): the set which contains all Xk+1 in DB • dbS(Xk+1): the set of transactions of which each transaction contains at least one Xk+1 • SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM • SuFSM (Support-counted FSM): • Theorem 3. If lmv(X)+Sup(S(Xk+1))×MV×(ML –k)< min_lmv, all supersets of X are infrequent

SuFSM (Support-counted FSM) • lmv(X)/k Sup(X) Sup(S(Xk+1)) • EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2, • If there is no superset of X is an SH-frequent itemset, then the following four equations hold • lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • lmv(X)+Sup(X) ×MV×(ML - k) < min_lmv • lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv

ShFSM (Share-counted FSM) • ShFSM (Share-counted FSM): • Theorem 4. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X are infrequent • FSM:lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv • ShFSM: Tmv(dbS(Xk+1)) < min_lmv

ShFSM (Share-counted FSM) • Ex. X={AB} • Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv

Experimental Results (1/3) minShare=0.3%

Experimental Results (2/3) minShare=0.3%

Experimental Results (3/3) • T6.I4.D100k.N200.S10 • minShare = 0.1% • ML=20

Direct Candidate Generation (DCG)Algorithm

Isolated Item Discarding Strategy (IIDS) for Utility Mining

IIDS (1/2) ShFSM minUtil=30%

IIDS (2/2) FUM minUtil=30%

Experimental Results (5/5) minUtil = 0.12% minUtil = 0.12%

Maximum Item Conflict First (MICF) Sanitization Method Tdegree(Tq): the degree of conflict of a sensitive transaction Tq is the number of restrictive itemsets which are included in Tq, If Tdegree(Tq) > 1, Tq is a conflicting transaction

Idegree({D}, {D, F}, T05)=1 • Idegree({F}, {D, F}, T05)=0 • MaxIdegree: store the maximum value of the conflict degree among items in a transaction • MICF: select an item with MaxIdegree to delete in each iteration

1 • Idegree({D}, {D, F}, T06)=1 • Idegree({F}, {D, F}, T06)=0 4

Fast Algorithms for Mining Frequent Itemsets