Fast Algorithms for Mining Frequent Itemsets

挖掘頻繁項目集合之快速演算法研究 Fast Algorithms for Mining Frequent Itemsets 博士論文計畫書指導教授: 張真誠教授研究生: 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: January 20, 2005

Outline • Introduction • Background and Related Work • A New FP-Tree for Mining Frequent Itemsets • Efficient Algorithms for Mining Share-Frequent itemsets • Conclusions

Introduction • Data mining techniques have been developed to fine a small set of precious nugget from reams of data • Mining association rules constitutes one of the most important data mining problem • Two sub-problem • Identifying all frequent itemsets • Using these frequent itemsets to generate association rules • The first sub-problem plays an essential role in mining association rules • Mining frequent itemsets & mining share-frequent itemsets

Background and Related Work • Support-Confidence Framework • Each item is a binary variable denoting whether an item was purchased • Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms • Pattern-growth algorithms (Han et al, 2000; Han et al, 2004) • Share-Confidence Framework (Carter et al., 1997 ) • Support-confidence framework does not analyze the exact number of products purchased. • The support count method does not measure the profit or cost of an itemset • Exhaustive search algorithm • Fast algorithms (but with errors)

Support-Confidence Framework (1/3) Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%

Support-Confidence Framework (2/3) • FP-growth algorithm (Han et al. 2000; Han et al., 2004)

Support-Confidence Framework (3/3) Conditional FP-tree of “D” Conditional FP-tree of “BD”

Share-Confidence Framework (1/6) • Measure value: mv(ip, Tq) • mv({D}, T01) = 1 • mv({C}, T03) = 3 • Transaction measure value: tmv(Tq) = • tmv(T02) = 9 • Total measure value: Tmv(DB)= • Tmv(DB)=44 • Itemset measure value: imv(X, Tq)= • imv({A, E}, T02)=4 • Local measure value: lmv(X)= • lmv({BC})=2+4+5=11

Share-Confidence Framework (2/6) • Itemset share: SH(X)= • SH({BC})=11/44=25% • SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset minShare=30%

Share-Confidence Framework (3/6) • ZP(Zero Pruning)、ZSP(Zero Subset Pruning) • variants of exhaustive search • prune the candidate itemsets whose local measure values are exactly zero • SIP(Share Infrequent Pruning) • like Apriori • with errors • CAC(Combine All Counted)、PCAC(Parametric CAC) • From ZSP, using a predict function • with errors • IAB(Item Add-Back)、PIAB(Parametric IAB) • join each share frequent itemset with each 1-itemset • with errors • Existing algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets

Share-Confidence Framework (4/6) ZP Algorithm SIP & IAB Algorithms

Share-Confidence Framework (5/6) ZSP Algorithm

Share-Confidence Framework (6/6) PSH(XY)=SH(X)+(SH(Y) × |dbx|/|DB|), |dbx|<|dbY|…(1) PSH(XY)=SH(Y)+(SH(X) × |dbY|/|DB|), |dbY|<|dbX|…(2) PSH(XY)=((1)+(2))/2, |dbY|=|dbX| PSH(AB)=(22.7%+18.2% × 4/6+18.2% + 22.7% × 4/6)/2=34.1% PSH(AE)=9.1%+22.7% × (2/6)=16.7% < 30% CAC Algorithm

A New FP-Tree for Mining Frequent Itemsets (1/3) • NFP-growth Algorithm • NFP-tree construction

A New FP-Tree for Mining Frequent Itemsets (2/3)

A New FP-Tree for Mining Frequent Itemsets (3/3) Conditional NFP-tree of “D(3,4)”

Experimental Results (1/4) • PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional • All algorithms were coded in VC++ 6.0 • Datasets: • Real: BMS-Web View-1, BMS-Web View-2, Connect 4 • Artificial: generated by IBM synthetic data generator

Experimental Results (2/4)

A Fast Algorithm for Mining Share-Frequent Itemsets • FSM: Fast Share Measure algorithm • ML: Maximum transaction length in DB • MV: Maximum measure valuein DB • min_lmv=minShare×Tmv • Level Closure Property: Given a minShare and a k-itemset X • Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all supersets of X with length k + 1 are infrequent • Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k’< min_lmv, all supersets of X with length k+k’ are infrequent • Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all supersets of X are infrequent

FSM: Fast Share Measure algorithm • minShare=30% • Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) • Prune X if CF(X)<min_lmv • CF({ABC})=3+(3/3)×3×(6-3)=12<14=min_lmv

ExperimentalResults (1/2) • T4.I2.D100k.N50.S10 • minShare = 0.8% • ML=14

Efficient Algorithms for Mining Share-Frequent itemsets • EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RCk-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently • Reduce time complexity from O(n2k-2) to O(nk)

Efficient Algorithms for Mining Share-Frequent itemsets • Xk+1:arbitrary superset of X with length k+1 in DB • S(Xk+1): the set which contains all Xk+1 in DB • dbS(Xk+1): the set of transactions of which each transaction contains at least one Xk+1 • SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM • SuFSM (Support-counted FSM): • Theorem 3. If lmv(X)+Sup(S(Xk+1))×MV×(ML –k)< min_lmv, all supersets of X are infrequent

SuFSM (Support-counted FSM) • lmv(X)/k Sup(X) Sup(S(Xk+1)) maxSup(Xk+1) • EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2, maxSup(Xk+1)=1 • If there is no superset of X is an SH-frequent itemset, then the following four equations hold • lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • lmv(X)+Sup(X) ×MV×(ML - k) < min_lmv • lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv • lmv(X)+maxSup(Xk+1) ×MV×(ML - k) < min_lmv

ShFSM (Share-counted FSM) • ShFSM (Share-counted FSM): • Theorem 4. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X are infrequent • FSM:lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv • ShFSM: Tmv(dbS(Xk+1)) < min_lmv

ShFSM (Share-counted FSM) • Ex. X={AB} • Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv

Experimental Results (2/4) minShare=0.3%

Experimental Results (3/4) minShare=0.3%

Experimental Results (4/4) • T6.I4.D100k.N200.S10 • minShare = 0.1% • ML=20

Conclusions • Support measure • Uses two counters per tree node to reduce the number of the tree nodes. • Applies a smaller tree and header table to discover frequent itemsets efficiently • Consider the development of superior data structures and extend the pattern-growth approach

Share measure • Proposed algorithms efficiently decrease the candidate number to be counted • The performance of ShFSM is the best • Consider the development of superior algorithms to accelerate the process of identifying all SH-frequent itemsets

ShFSM: Tmv(dbS(Xk+1)) < min_lmv

Thank You!

Fast Algorithms for Mining Frequent Itemsets