240 likes | 353 Views
Efficient Algorithms for Mining Share-Frequent Itemsets. Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005. Outline. Introduction Related Work Enhanced Fast Share Measure (EFSM) Algorithm Support-Counted Fast Share Measure (SuFSM) Algorithm
E N D
Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005
Outline • Introduction • Related Work • Enhanced Fast Share Measure (EFSM) Algorithm • Support-Counted Fast Share Measure (SuFSM) Algorithm • Share-Counted Fast Share Measure (ShFSM) Algorithm • Experimental Results • Conclusions
Introduction (1/2) • Goal: discovering the buying patterns of customers • Itemset: a group of items (products) bought together in a transaction • Support: the ratio of transactions containing the itemset to the total transaction number (limited in informative feedback) • Share: the ratio of the total count of items in the itemset to the total count of items in the database
Introduction (2/2) • Share-confidence framework: providing useful information about numerical values associated with transaction items ( Carter et al., 1997) • Share-frequent (SH-frequent) itemset: usually includes some infrequent subsets • Fast Share Measure (FSM) algorithm discovers share-frequent itemsets on small dataset efficiently • This study proposes Enhanced FSM, SuFSM and ShFSM to discover share-frequent itemsets more efficiently than that of FSM
Related Work • Support-Confidence Framework (Agrawal et al., 1993) • Each item is a binary variable denoting whether an item was purchased • Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms • Pattern-growth algorithms (Han et al., 2000; Han et al, 2004) • Share-Confidence Framework (Carter et al., 1997) • Support-confidence framework does not analyze the exact number of products purchased • The support count method does not measure the profit or cost of an itemset • Exhaustive search algorithm (Carter et al., 2000) • FSM algorithm (Li et al., 2005)
Related Work Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%
Share-Confidence Framework • Measure value: mv(ip, Tq) • mv({D}, T01) = 1 • mv({C}, T03) = 3 • Transaction measure value: tmv(Tq) = • tmv(T02) = 9 • Total measure value: Tmv(DB)= • Tmv(DB)=44 • Itemset measure value: imv(X, Tq)= • imv({A, E}, T02)=4 • Local measure value: lmv(X)= • lmv({BC})=2+4+5=11
Itemset share: SH(X)= • SH({BC})=11/44=25% • SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset minShare=30%
Existing algorithms • ZP(Zero Pruning)、ZSP(Zero Subset Pruning) • Variants of exhaustive search • Prune the candidate itemsets whose local measure values are exactly zero • FSM(Fast Share Measure)(Li et al., 2005) • Fast on a small dataset • Generate too many candidates • Existing algorithms are inefficient on a large datasets
FSM: Fast Share Measure Algorithm • ML: Maximum transaction length in DB • MV: Maximum measure valuein DB • Let min_lmv=minShare×Tmv • Let CF(X)FSM= lmv(X)+(lmv(X)/k)×MV ×(ML-k) • If CF(X)FSM< min_lmv, all supersets of X are infrequent
FSM: Fast Share Measure Algorithm • minShare=30%, ML=6, MV=3, TMV=44 • min_lmv=14 • Prune X if CF(X)FSM <min_lmv • Let X={A B C} • CF(X)FSM =3+(3/3)×3×(6-3)=12<14=min_lmv
Enhanced FSM (EFSM) Algorithm • EFSM: instead of joining arbitrary two itemsets in RCk-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently • Reduce time complexity from O(n2k-2) to O(nk)
SuFSM (Support-counted FSM) • Xk+1:arbitrary superset of X with length k+1 in DB • S(Xk+1): the set which contains all Xk+1 in DB • dbS(Xk+1): the set of transactions of which each transaction contains at least one Xk+1 • SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM • SuFSM (Support-counted FSM): • Theorem 1. If lmv(X)+Sup(S(Xk+1))×MV×(ML – k)< min_lmv, all supersets of X are infrequent
lmv(X)/k Sup(X) Sup(S(Xk+1)) EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2 If there is no superset of X is an SH-frequent itemset, then the following three equations hold lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv lmv(X)+Sup(X) ×MV×(ML - k) < min_lmv lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv SuFSM (Support-counted FSM)
ShFSM (Share-counted FSM) • dbS(Xk+1): the set of transactions of which each transaction contains at least one Xk+1 • ShFSM (Share-counted FSM): • Theorem 2. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X are infrequent • FSM:lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv • ShFSM: Tmv(dbS(Xk+1)) < min_lmv • CF(X)FSM>=CF(X)SuFSM>=CF(X)ShFSM
FSM:lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv • ShFSM: Tmv(dbS(Xk+1)) < min_lmv • Ex. X = {BCD} • CF(X)FSM = 9+(9/3)×3×(6-3)=36 • CF(X)SuFSM = 9+2×3×(6-3)=18 • CF(X)ShFSM = 6+8=14
ShFSM (Share-counted FSM) • Ex. X={AB} • Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv
Experimental Results (1/3) • PC: Pentium IV 1.5 GHZ, 1.5GB SDRAM, running Windows XP professional • All algorithms were coded in VC++ 6.0 Figure 1 Figure 2
Experimental Results (2/3) minShare=0.1% Figure 3 Figure 4
T6.I4.D100k.N200.S10 minShare = 0.1% ML=20 , MV=10 Tmv=2,302,443 ExperimentalResults (3/3)
Conclusions • This study proposes the Enhanced FSM (EFSM) algorithm to efficiently reduce the time complexity of the join step • We have also developed SuFSM and ShFSM from EFSM • SuFSM and ShFSM can efficiently prune the candidates, and significantly improve the performance • The experimental results have indicated that ShFSM has the best performance • In the future, we plan to develop even more advanced algorithms to accelerate the process of identifying all share-frequent itemsets