590 likes | 1.26k Views
PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. Jian Pei, Jiawei Han, Behzad Mortazavi-Asl , Helen Pinto, Qiming Chen, Umeshwar Dayal , Mei-Chun Hsu. 17 th ICDE’01. 組員: 沈郁棋 簡志佳 吳永斌 曾文彥 沈家譽. Outline. Introduction FreeSpan
E N D
PrefixSpan : Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jian Pei, Jiawei Han, BehzadMortazavi-Asl, Helen Pinto, Qiming Chen, UmeshwarDayal, Mei-Chun Hsu 17th ICDE’01 • 組員:沈郁棋 簡志佳吳永斌 曾文彥沈家譽
Outline • Introduction • FreeSpan • PrefixSpan algorithm • Improvement of PrefixSpan • Experimental results and performance • Conclusion • References • Discussions
Introduction • What is a sequence and sequential pattern mining? • Applications of sequential pattern mining • Previous method: • GSP algorithm • FreeSpan algorithm • New method proposed in this paper: • PrefixSpan
Sequence and subsequence • A sequence is an ordered list of itemset • Each sequence consists of a list elements and each element consists of a set of items • Example: α = <a(abc)(ac)d(cf)> is a sequence • An element may contain an item or many items. Items within an element are unordered and we list them alphabetically. Elements Item
Sequence and subsequence (Cont.) • If α = <a(abc)(ac)d(cf)> is a sequence, then β = <a(bc)df> is a subsequence of α. • The number of items in a sequence is the length of the sequence. For example, the length of α is 9. Transaction database Sequence database v.s.
Sequential Pattern Mining • Proposed by R. Agrawal and R. Srikant in 1995 • Given a user-specified minimum support threshold, sequential pattern mining is to find all of the frequent subsequences. • Applications: • Analyses of customer purchase behavior • Web access pattern • Scientific experiments • Disease treatments • Natural disasters • DNA sequences
SID • Sequence • 10 • <(bd)cb(ac)> • 20 • <(bf)(ce)b(fg)> • 30 • <(ah)(bf)abf> • 40 • <(be)(ce)d> • 50 • <a(bd)bcb(ade)> GSP Algorithm • An Apriori-like method • Example: Seed set 1st scan: 8 cand. 6 length-1 seq. pat. <a> <b> <c> <d> <e> <f> <g> <h> 2nd scan: 51 cand. 19 length-2 seq. pat. <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> Seed set <abb> <aab> <aba> <baa> <bab> … 3rd scan: 46 cand. 19 length-3 seq. pat. min_sup=2
Disadvantages of GSP • Potentially huge set of candidate sequences • Apriori-based method may generate a large set of candidate sequences which contains all the possible permutations of the elements. • For example, if there are 1000 frequent sequences of length-1, , the number of candidates will be Derived from the set Derived from the set ,
Disadvantages of GSP (Cont.) • Multiple scans of databases • The length of each candidate sequence grows by one at each database scan. • To find <(abc) (abc) (abc) (abc) (abc)>, GSP must scan the database at least 15 times.
Disadvantages of GSP (Cont.) • Difficulties at mining long sequential patterns • There is only a single sequence of length 100, min_sup=1 • length-1 candidate sequences : • length-2 candidate sequences : • length-3 candidate sequences : • Total
FreeSpan : Frequent Pattern-Projected Sequential Pattern Mining • A divide-and-conquer approach • Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns. • Mining each projected database to find its patterns.
FreeSpan • finding f_list: • finding the set of frequent items, and they are listed in support descending order. Items: a,b,c,d,e,f,g (item, support): (a,4),(b,4),(c,4),(d,3),(e,3),(f,3),(g,1) Support threshold:2 f_list = a:4, b:4, c:4, d:3, e:3, f:3 SID: sequence_id
FreeSpan f_list = a:4, b:4, c:4, d:3, e:3, f:3 The complete set of sequential patterns in sequence database can be divided into 6 disjoint subsets: The ones containing only the item a. The ones containing item b but containing no items after b in f_list. The ones containing item c but containing no items after c in f_list. The ones containing item d but containing no items after d in f_list. The ones containing item e but containing no items after e in f_list. The ones containing item f.
FreeSpan Finding sequential patterns containing only item a. By scanning sequence database once, the only two sequential patterns are found: <a>, <aa> <aa> support:2= frequent threshold: 2 <a> =<(a)> support:4> frequent threshold: 2 <aaa> support:1<= frequent threshold: 2
FreeSpan Finding sequential patterns containing item b but no item after b in f_list. This can be achieved by constructing the {b}-projected database. Finding sequences in sequence database containing item b. <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc> Removing all items after b in f_list(a:4, b:4, c:4, d:3, e:3, f:3). <a(ab)(a)> <(a)(b)(a)> <(ab)b> <(a)b> <a(ab)a> <aba> <(ab)b> <ab>
FreeSpan By scanning the projected base once more, all sequential patterns containing item b but no item after b in f_list are found: {b}-projected database <a(ab)a> <aba> <(ab)b> <ab> <b>, <ab>, <ba>, <(ab)> Support=3 And then, finding other subsets of sequential patterns.
FreeSpan • FreeSpanis more efficient than GSP. • The major cost of FreeSpan is to deal with projected databases. If a pattern appears in each sequence of adatabase, its projected database does not shrink . {f}-projected database f_list = a:4, b:4, c:4, d:3, e:3, f:3 <a(abc)(ac)d(cf)> <(ef)(ab)(df)cb> <e(af)cbc>
PrefixSpan (Prefix-Projected Sequential Pattern Growth)
Prefix • Given two sequences α=<a1a2…an> and β=<b1b2…bm>, m≤n. • Sequence β is called a prefix of α if and only if: • bi= ai for i ≤ m-1; • bm ⊆ am; • Example : • α =<a(abc)(ac)d(cf)> • β =<a(abc)a> α=<a1a2…am-1 am…..an> β=<b1b2…bm-1bm >
Postfix (Projection) • Let α’ =<a1,a2…an> be the projection of α w.r.t. prefix • β=<a1a2…am-1a’m> (m ≤n) • Sequence γ=<a’’mam+1…an> is called the postfix of α w.r.t. prefix β, denoted as γ= α/ β, where a’’m=(am - a’m). • We also denote α =β⋅ γ. • Example: • α’ =<a(abc)(ac)d(cf)>, • β =<a(abc)a>, • γ=<(_c)d(cf)>.
PrefixSpan – Algorithm • Input of the algorithm : A sequence database S, and the minimum support threshold min_support. • Output of the algorithm: The complete set of sequential patterns. • Subroutine: PrefixSpan(α, L, S|α). • Parameters: • α: sequential pattern, • L: the length of α; • S|α: the α-projected database, if α ≠<>; otherwise; the sequence database S. • Call PrefixSpan(<>,0,S).
PrefixSpan - Example • Step 1: Find length-1 sequential patterns. (min_sup=2) length-1 sequential patterns : <a><b><c><d><e><f>
Step 2: divide search space. Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Having prefix <c>, …, <f> Length-2 sequential patterns <aa>:2, <ab>:4, <(ab)>:2, <ac>:4, <ad>:2, <af>:2 … … … … … Support=1 < min_sup <aaa> is not sequential pattern Terminate!!
Efficiency of PrefixSpan • No candidate sequence needs to be generated • Projected databases keep shrinking • Major cost of PrefixSpan: constructing projected databases -> Can be improved by bi-level projections
Bi-level Projection • The major cost of PrefixSpan is to construct projected databases. • A bi-level projection scheme is proposed to reduce the number and the size of projected databases.
Bi-level Projection - Example • Scan S to find the length-1 sequential patterns:<a>, <b>, <c>, <d>, <e>, <f>.
Bi-level Projection - Example • Instead of constructing projected databases for each length-1 sequential pattern, we construct a 6*6 lower triangular matrix M, as shown in Table 3. Sequence <cc> appears in three sequences in S.
Bi-level Projection - Example • Instead of constructing projected databases for each length-1 sequential pattern, we construct a 6*6 lower triangular matrix M, as shown in Table 3. supports(<ac>) = 4, supports(<ca>) = 2 and supports(<(ac)>) = 1.
Bi-level Projection - Example • The <ab>-projected database contains three sequences: <(_c)(ac)(cf)>, <(_c)a> and <c>. • Three frequent items are found: <a>, <c> and <(_c)>. Length-2 pattern <(_c)a> can be generated
Bi-level Projection - Example • Using level-by-levelprojection, to find the complete set of 53 sequential patterns, 53 projected databases are constructed. • Only 22 projected databases are constructed by bi-level projection.
Optimization “do we need to include every item in a postfix in the projected databases?”
Optimization • Consider the <ac>-projected database… Exclude item d from <ac>-projected database!
Pseudo-Projection • By examining a set of projected databases, one can observe that postfixes of a sequence often appear repeatedly in recursive projected databases. • Sequence <a(abc)(ac)d(cf)> has postfixes <(abc)(ac)d(cf)> and <(_c)(ac)d(cf)> as projections in <a>- and <ab>-projected databases, respectively.
Pseudo-Projection • Every projection consists of two pieces of information: pointer to the sequence in database and offset of the postfix in the sequence.
Pseudo-Projection • For example, suppose the sequence database S in Table 1 can be held in main memory. • When constructing <a>-projected database, the projection of sequence s1 = <a(abc)(ac)d(cf)> consists two pieces: a pointer to s1 and offset set to 2, i.e., postfix (abc)(ac)d. • <ab>-projected database contains a pointer to s1 and offset set to 4.
Dataset • The number of items is set to 1000. • There are 10000 sequences in the data set. • The average number of items within elements is set to 8. • The average number of elements in a sequence is set to 8.
Performance GSP:The GSP algorithm FreeSpan:FreeSpan with alternative level projection PrefixSpan-1:PrefixSpan withlevel-by-level projection PrefixSpan-2:PrefixSpan withbi-level projection • Both FreeSpan and PrefixSpan win GSP • PrefixSpan methods are more efficient than FreeSpan • When the support threshold is low, PrefixSpan-1 requires a major effort to generate projected databases. • The performance curves of PrefixSpan-1 and PrefixSpan-2 are close when support threshold is not low Figure 1
Performance (Cont.) • This figure shows that using pseudo-projections for the projected databases that can be held in main memory improves efficiency of PrefixSpan further. • pseudo-projection improves performance when the projected database can be held in main memory • A related question becomes: • “Can such a method be extended to disk-based processing?” Figure 2
Performance (Cont.) • This figure shows the I/O costs of PrefixSpan-1 and PrefixSpan-2 as well as of their pseudo-projection variations • Dataset • The number of items is set to 1000. • There are 1 million sequences in the data set. • The average number of items within elements is set to 8. • The average number of elements in a sequence is set to 8. Figure 3
Performance (Cont.) • This figure shows the scalability of PrefixSpan-1 and PrefixSpan-2 with respect to the number of sequences. • Both are linearly scalable • Since the support threshold is set to 0.20%, PrefixSpan-2 performs better. Figure 4
Conclusions • PrefixSpanis efficient pattern growth method. • The performance of PrefixSpan is batter then GSP and FreeSpan. • Prefix-projection reduces the size of projected database and leads to efficient processing. • Bi-level projection and pseudo-projection improve sequential pattern mining efficiency.
Reference • [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487–499, Santiago, Chile, Sept. 1994. • [2] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering (ICDE’95), pages 3–14, Taipei, Taiwan, Mar. 1995. • [3] C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32–38, 1998. • [4] M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. In Proc. 1999 Int. Conf. Very Large Data Bases (VLDB’99), pages 223–234, Edinburgh, UK, Sept. 1999. • [5] J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. In Proc. 1999 Int. Conf. Data Engineering (ICDE’99), pages 106–115, Sydney, Australia, Apr. 1999. • [6] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan: Frequent pattern-projected sequential pattern mining. In Proc. 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), pages 355–359, Boston, MA, Aug. 2000. • [7] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. 2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’00), pages 1– 12, Dallas, TX, May 2000. • [8] H. Lu, J. Han, and L. Feng. Stock movement and ndimensional inter-transaction association rules. In Proc. 1998 SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery (DMKD’98), pages 12:1–12:7, Seattle, WA, June 1998. • [9] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259–289, 1997. • [10] B. O¨ zden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. In Proc. 1998 Int. Conf. Data Engineering (ICDE’98), pages 412–421, Orlando, FL, Feb. 1998. • [11] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. 5th Int. Conf. Extending Database Technology (EDBT’96), pages 3–17, Avignon, France, Mar. 1996.
Discussions • The database is static. If any data is added for mining, there will be some problems when doing PrefixSpan. • If the database is dynamic, it will be too large. We have to consider old data while mining patterns. • PrefixSpan cannot take duration into account for an item.
Discussions • The database is static. If any data is added for mining, there will be some problems when doing PrefixSpan. • Incremental database, data steam • If the database is dynamic, it will be too large. We have to consider old data while mining patterns. • Progressive database • PrefixSpancannot take duration into account for an item. • Time interval database