210 likes | 358 Views
Sequential PAttern Mining using A Bitmap Representation. Jay Ayres, Johannes Gehrke Tomi Yiu, Jason Flannick ACM SIGKDD 2002. Outline. Introduction The SPAM Algorithm Data Representation Experimental Evaluation Conclusion. Introduction.
E N D
Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Johannes Gehrke Tomi Yiu, Jason Flannick ACM SIGKDD 2002
Outline • Introduction • The SPAM Algorithm • Data Representation • Experimental Evaluation • Conclusion
Introduction • Sequential Patterns Mining :依據顧客購買商品的交易資料庫來進行分析,用來找出顧客購買商品行為是否存在商品間先後順序關係。 • Input Output ({b}, {d, c}, {c}) I={a, b, c, d} Dataset sorted by CID and TID
Introduction (Frequent) • Sa:A Sequence. Example:({a}, {c}) • supD(Sa):The support of Sa in D.Example:Sa=({a},{b, c}), we know supD(Sa)=2 • Given a support threshold minSup, a Sequence Sa is called a frequent sequential pattern on D if supD(Sa) minSup. • Example:minSup=2 • Sa is deemed frequent. Sequence for each customer
The SPAM Algorithm(Lexicographic Tree for Sequences) Consider all sequences arranged in a sequence treewith the following structure: • The root of the tree is labeled with . {} I={a, b} Sequence a Sequence a, a a, b (a, b) Figure 1:The Lexicographic Sequence Tree a, a, a a, a, b a, (a, b)
The SPAM Algorithm(Lexicographic Tree for Sequences) • Each sequence in the sequence tree can be considered as either a sequence-extended sequence or an itemset-extended sequence. {} I={a, b} a a, a a, b (a, b) Figure 1:The Lexicographic Sequence Tree a, a, a a, a, b a, (a, b)
The SPAM Algorithm (Lexicographic Tree for Sequences) ({a, b, c}, {a, b}) • For example: Sa= • sequence-extended sequence of Sa. • itemset-extended sequence of Sa. =S-Step =I-Step ({a, b, c}, {a, b}) ({a, b, c}, {a, b}) S-Step I-Step ……… ({a, b, c}, {a, b}, {d}) ({a, b, c}, {a, b, d})
The SPAM Algorithm (Lexicographic Tree for Sequences) • Sn, the set of candidate items that are considered for a possible S-step extensions of node n (abbreviated s-extensions). Example : S({a})={a, b, c, d} a S-Step a, a a, b Sequence for each customer a, c a, d
The SPAM Algorithm (Lexicographic Tree for Sequences) • In, which identifies the set of candidate items that are considered for a possible I-step extensions (abbreviated, i-extensions). Example : I({a})={b, c, d} a I-Step (a, b) (a, c) (a, d) Sequence for each customer
The SPAM Algorithm (Depth First Tree Traversal) I={a, b} • A standard depth-first manner. • supD(S) minSup, store S and repeat DFS. {} a • supD(S) < minSup,not need to repeat DFS. a, a a, b (a, b) • A huge search space. a, a, a a, a, b a, (a, b) Figure 1: The Lexicographic Sequence Tree
The SPAM Algorithm(Pruning) • S-step PruningSuppose we are at node ({a}) in the tree and suppose that S({a}) = {a, b, c, d}, I({a}) = {b, c, d}. Suppose that ({a}, {c}) and ({a}, {d}) are not frequent. By Apriori principle; ({a},{a},{c}) ({a},{b},{c}) ({a},{a,c}) ({a},{b,c})({a},{a},{d})({a},{b},{d}) ({a},{a,d}) ({a},{b,d})are not frequent Hence, when we are at node ({a}, {a}) or ({a}, {b}), we do not have to perform I-step or S-step using items c and d, i.e. S({a},{a}) = S({a},{b}) = {a, b} I({a},{a}) = {b}, and I({a}, {b}) = .
a a,a a,b a,c a,d {a,b} {a,c} {a,d} a,a,a a,{b,d} a,{b,c} a,a,b a,{a,d} a,b,d a,a,c a,b,a a,{a,c} a,a,d a,b,c a,b,b a,{a,b}
The SPAM Algorithm(Pruning) • I-step Pruning (Example)Let us consider the same node ({a}) described in the previous section. The possible itemset-extended sequences are ({a, b}), ({a, c}), and ({a, d}). If ({a, c}) is not frequent, then ({a, b, c}) must also not be frequent by the Apriori principle. Hence, I({a, b}) = {d}, S({a,b}) = {a, b}, and S({a, d}) = {a, b}.
a {a,d} {a,b} {a,c} {a,b,d} ………… {a,d},a {a,d},b {a,b},a {a,b,c} {a,b},d {a,b},b {a,b},c
Data Representation (Data Structure) • Our algorithm uses a vertical bitmap representation of the data. • A vertical bitmap is created for each item in the dataset, and each bitmap has a bit corresponding to each transaction in the dataset. • If item i appears in transaction j, then the bit corresponding to transaction j of the bitmap for item i is set to one; otherwise, the bit is set to zero.
Data Representation (Data Structure) • If the size of a sequence is between 2k + 1 and 2k+1, we consider it as a 2K+1-bit sequence. • (the size of a sequence =3), k=1, 21+1-bit=4-bit CID TID 1 1 1 3 1 6- - 2 2 2 4 - -- - 3 5 3 7 - -- - Dataset sorted by CID and TID
Data Representation (Candidate Generation) S-step Process • We first generate a bitmap from B(sa) such that all bits less than or equal to k are set to zero, and all bits after k are set to one. • We call this bitmap a transformed bitmap. • We then AND the transformed bitmap with the item bitmap. • The resultant bitmap has the properties described above and it is exactly the bitmap for the generated sequence.
Data Representation (Candidate Generation) S-step Process S-step result & process
Data Representation (Candidate Generation) I-step Process result &
Experimental Evaluation(ComparisonWith SPADE and PrefixSpan) • We compared the three algorithms on several small, medium, and large datasets for various minimum support values. • This set of tests shows that SPAM outperforms SPADE by about a factor of 2.5 on small datasets and better than an order of magnitude for reasonably large datasets. • PrefixSpan outperforms SPAM slightly on very small datasets, but on large datasets SPAM outperforms PrefixSpan by over an order of magnitude.
Conclusion • We presented an algorithm to quickly find all frequent sequences in a list of transactions. • The algorithm utilizes a depth-first traversal of the search space combined with a vertical bitmap representation to store each sequence. • Experimental results demonstrated that our algorithm outperforms SPADE and PrefixSpan on large datasets