Advanced Topics in Data Mining: Sequential Patterns

Advanced Topics in Data Mining:Sequential Patterns

Sequential Pattern Analysis

Sequential Pattern Mining • Progress in bar-code technology has made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data • A record in such data typically consists of the transaction date and the items bought in the transaction • Very often, data records also contain customer-id, particularly when the purchase has been made using a credit card or a frequent-buyer card • Catalog companies also collect such data using the orders they receive

Sequential Pattern Mining • An example of such a pattern is that customers typically rent “Star Wars (星際大戰)”, then “Empire Strikes Back (帝國大反擊)”, and then “Return of the Jedi (絕地大反攻)” • These rentals need not be consecutive • Customers who rent some other videos in between also support this sequential pattern • Elements of a sequential pattern need not be simple items • “Computer Science and Programming Language”, followed by “Data Structure”, followed by “System Programs and Operating Systems” is an example of a sequential pattern in which the elements are sets of items

Sequential Pattern Mining • Given Transaction Time, Customer Id, Items Bought Original Database Answer Set

Definition • The length of a sequence is the number of itemsets in the sequence • A sequence of length k is called a k-sequence • The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction • The itemset i and the 1-sequence have the same support • An itemset with minimum support is called a large (frequent) itemset or litemset

AprioriAll Algorithm • Each itemset in a large sequence must have minimum support • Any large sequence must be a list of litemsets • Finding all sequential patterns in five phases • Sort Phase • Litemset Phase • Transformation Phase • Sequence Phase • Maximal Phase

AprioriAll Algorithm:Sort Phase Customer-Sequence Version of the Database

AprioriAll Algorithm:Litemset Phase Apriori/DHP FP Growth min_sup_count=2

AprioriAll Algorithm:Transformation Phase

AprioriAll Algorithm:Sequence Phase Large 2-Sequences Customer Sequences Large 1-Sequences Large 4-Sequences Maximal Large Sequences Large 3-Sequences

Sequence Phase:Candidate Generation

AprioriAll Algorithm:Maximal Phase • The sequence <(3) (4 5) (8)> is contained in <(7) (3 8) (9) (4 5 6) (8)>, since (3)  (3 8), (4 5)  (4 5 6) and (8)  (8) • The sequence <(3) (5)> is not contained in <(3 5)> (and vice versa) • The former represents items 3 and 5 being bought one after the other • The latter represents items 3 and 5 being bought together. • In a set of sequences, a sequence s is maximal if s is not contained in any other sequence.

AprioriAll Algorithm Answer Set • With minimum support set to 25%, i.e., a minimum support of 2 customers • < (30) (90)> and <(30) (40 70)> are maximal • <(10 20) (30)> which is only supported by customer 2 does not have minimum support • <(30)>, <(40)>, <(70)>, <(90)>, <(30) (40)>, <(30) (70)> and <(40 70)>, though having minimum support, are not in the answer because they are not maximal.

Summary

Discussions • AprioriAll algorithm will generate a huge set of candidate sequences • If there are 1000 frequent sequences of length-1, the algorithm will generate 1000 × 1000 + (1000 × 999) / 2 = 1,499,500 candidate sequences • Many scans of databases in mining • Difficulties at mining long sequential patterns

Methods to Improve AprioriAll’s Efficiency • PrefixSpan • Without Candidate Generation • Reduce Database Scan (Scan Database Twice) & Database Size • The general idea of the method is to use projected sequence databases to confine the search and the growth of subsequence fragments

PrefixSpan • PrefixSpan-1 • Single-Level Projection • PrefixSpan-2 • Bi-Level Projection • S-Matrix • PrefixSpan use Pseudo-Projection

A sequence: < (ef) (ab) (df) c b > A Sequence Database Elements items within an element are listed alphabetically SID Sequence 10 <a(abc)(ac)d(cf)> <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> Let min_sup = 2, <(ab)c> is a sequential pattern 40 <eg(af)cbc> Definition

Definition • Prefix and Postfix (Projection) • <a>, <aa>, <a(ab)>, <a(abc)>, … are prefixes of sequence <a(abc)(ac)d(cf)> • Given Sequence <a(abc)(ac)d(cf)>

PrefixSpan-1 • Find Length-1 (L1) Sequential Patterns • Construct Projected Database According to L1 • Mining Each Projected DB Recursively

PrefixSpan-1: An Example Min_Support_Count = 2 L1: <a>: 4, : 4, <c>: 4 <d>: 3, <e>: 3, <f>: 3

PrefixSpan-1:An Example

PrefixSpan-1: An Example Scanning <a>-Projected database once: a:2, b:4, c:4, d:2, e:1, f:2 (_b):2, (_c):1, (_d):1, (_e):1, (_f):1 L2: <aa>: 2 , <ab>: 4 , <(ab)>: 2 <ac>: 4 , <ad>: 2 , <af>: 2

PrefixSpan-1: An Example

PrefixSpan-1: An Example Scanning <ab>-Projected database once: a:2 , c:2 , d:1 , f:1 , (_c):2 L3: <a(bc)>: 2, <aba>: 2, <abc>: 2

PrefixSpan-1: An Example Scanning <a(bc)>-Projected database once: a:2 , c:1 , d:1 , f:1 L4: <a(bc)a>: 2

PrefixSpan-1: An Example

SDB SID sequence 10 <a(abc)(ac)d(cf)> Length-1 sequential patterns <a>, , <c>, <d>, <e>, <f> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Having prefix <c>, …, <f> Having prefix <a> Having prefix -projected database … Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> … … Having prefix <aa> Having prefix <af> <aa>-proj. db <af>-proj. db … Completeness of PrefixSpan-1

Analysis • No candidate sequence needs to be generated by PrefixSpan • Projected databases keep shrinking • The major cost of PrefixSpan is the construction of projected databases

PrefixSpan-2 • Find Length-1 Sequential Patterns • Construct Triangular Matrix M (S-Matrix) • By scanning DB second time, the S-matrix can be filled up • Construct Projected Database • For each length-2 sequential pattern, construct its projected DB • Mining each projected DB recursively

PrefixSpan-2: An Example Min_Support = 2 L1: <a>: 4, : 4 , <c>: 4 <d>: 3, <e>: 3, <f>: 3

<ab> happens 4 times <bb> happens 1 times a 2 b (4,2,2) 1 <dc> happens 3 times c (4,2,1) (3,3,2) 3 d (2,1,1) (2,2,0) (1,3,0) 0 <(ef)> happens 1 times e (1,2,1) (1,2,0) (1,2,0) (1,1,0) 0 f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) 1 a b c d e f PrefixSpan-2: An Example S-Matrix

a 2 b (4,2,2) 1 a 0 c (4,2,1) (3,3,2) 3 c 1 (1,0,1) (_c)  (,1, ) (,2, ) d (2,1,1) (2,2,0) (1,3,0) 0 Lead to pattern <a(bc)a> a c (_c) e (1,2,1) (1,2,0) (1,2,0) (1,1,0) 0 f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) 1 a b c d e f PrefixSpan-2: An Example No hope to form (_cc),So no need to count it <ab>-projected database <(_c)(ac)d(cf)> <(_c)a> <c> Local length-1 sequential patterns: <a>, <c>, <(_c)>

Benefits of Bi-Level Projection • More patterns are found in each shoot • Much Less Projections • In this example, there are 53 patterns • 53 Level-by-Level Projections • 22 Bi-Level Projections

s=<a(abc)(ac)d(cf)> <a> s|<a>: ( , 2) <(abc)(ac)d(cf)> <ab> s|<ab>: ( , 4) <(_c)(ac)d(cf)> Speed-Up by Pseudo-Projection • Major Cost of PrefixSpan: Projection • Postfixes of sequences often appear repeatedly in recursive projected databases • When (projected) database can be held in main memory, use pointers to form projections • Pointer to the sequence • Offset of the postfix

Mining Time-Gap Sequential Patterns (TGSP) • Sequential Pattern • A  B  C • Time Gap Sequential Pattern • A  B  C (3-5) (5-7)

交易時間序列資料庫 交易資料庫交易時間序列資料庫

交易時間序列 • K-交易時間序列 • < I1(T1), I2(T2), …, Ik(Tk)> • 顧客1存在3-交易時間序列 <c(11), a(16), c(16)>

交易時間間隔&項目序列 • K-交易時間間隔序列 • 表示成＜I1, (t1), I2, (t2), …, (tk-1), Ik＞其中Ii為單一項目，ti為Ii與Ii+1購買時間間隔 • 3-交易時間序列＜A(10), B(15), D(30)＞的交易時間間隔序列為＜A, (5), B, (15), D＞ • K-項目序列 • 表示成<I1, I2, …, Ik>，為多個項目依照購買時間先後排列而成的，若其相同時間購買之項目，則以編號較小之項目排在前面 • 3-交易時間序列< A(10), B(15), D(30)>所對應的3-項目序列為<A, B, D>

時間間隔序列 & 包含 • K-時間間隔序列 • 表示成＜I1，R1，I2，R2，…，Rk-1，Ik＞，其中Ii為一個單一項目，Ri = li ~ ui，為一段時間範圍，表示項目Ii與Ii+1的購買時間間隔範圍介於li和 ui中間 • 4-時間間隔序列 • ＜A, (5~8), B, (3~6), C, (5~8), D＞ • 交易時間間隔序列＜A, (7), B, (4), C, (5), D＞包含於時間間隔序列＜A, (5~8), B, (3~6), C, (5~8), D＞

支持 • 顧客交易時間序列 C =＜A(15), B(22), C(26), D(31), E(39)＞存在一個4-交易時間序列＜A(15), B(22), C(26), D(31)＞此交易時間序列的交易時間間隔序列為＜A, (7), B, (4), C, (5), D＞包含於時間間隔序列 S =＜A, (5~8), B, (3~6), C, (5~8), D＞所以顧客交易時間序列C支持時間間隔序列S ，且此顧客交易時間序列C支持項目序列< A B C D >

支持度 • K-時間間隔序列的支持度為支持此時間間隔序列的顧客數與資料庫中所有顧客數的比值 • 若K-時間間隔序列的支持度大於或等於使用者所訂定的最小支持度的話，我們將其稱為K-頻繁時間間隔序列 • K-項目序列的支持度為支持此項目序列的顧客數與資料庫中所有顧客數的比值 • 若K-項目序列的支持度大於或等於最小支持度，則我們稱之為K-頻繁項目序列

挖掘時間間隔序列型樣 • 找出1-頻繁項目序列 • 找出2-頻繁項目序列 • 產生2-項目序列資料庫 • 找出2-頻繁時間間隔序列 • 產生K-項目序列資料庫(K≧3) • 找出K-頻繁時間間隔序列(K≧3) • 找出時間間隔序列型樣

找出1-頻繁項目序列 假設最小支持度為1/2 各項目支持度為 A=8/8=1, B=8/8=1, C=7/8, D=6/8, E=3/8, F=5/8，則項目A，B，C，D，F為1-頻繁項目序列

找出2-頻繁項目序列 • 產生2-候選項目序列 • 由 1-頻繁項目序列A，B，C，D，F 配對後可以產生＜AA＞, ＜AB＞, ＜AC＞, ＜AD＞, ＜AF＞, ＜BA＞, ＜BB＞, ＜BC＞, ＜BD＞, ＜BF＞, ＜CA＞, ＜CB＞, ＜CC＞, ＜CD＞, ＜CF＞, ＜FA＞, ＜FB＞, ＜FC＞, ＜FD＞, ＜FF＞的2-候選項目序列 • 產生2-頻繁項目序列 • 掃描資料庫，計算各2-候選項目序列的支持度 • ＜AA＞=1/8, ＜AB＞=1, ＜AC＞=5/8, ＜AD＞=6/8, ＜AF＞=5/8, ＜BA＞=1/8, ＜BB＞=0, ＜BC＞=5/8, ＜BD＞=6/8, ＜BF＞=5/8, ＜CA＞=2/8, ＜CB＞=2/8, ＜CC＞=0, ＜CD＞=5/8, ＜CF＞=2/8,＜FA＞=1/8, ＜FB＞=0, ＜FC＞=2/8, ＜FD＞=3/8, ＜FF＞=0 • 產生2-頻繁項目序列(1/2) • ＜AB＞, ＜AC＞, ＜AD＞, ＜AF＞, ＜BC＞, ＜BD＞, ＜BF＞, ＜CD＞為2-頻繁項目序列

產生2-項目序列資料庫 2-頻繁項目序列＜AB＞, ＜AC＞, ＜AD＞, ＜AF＞, ＜BC＞, ＜BD＞, ＜BF＞, ＜CD＞

產生2-項目序列資料庫 在產生2-項目序列資料庫時，顧客1會拆解出{ A(5) C(10) }、 { A(5) B(13) }、 { A(5) A(15) }、 { C(10) B(13) }、 { C(10) A(15) }、 { C(10) C(20) }、 { B(13) A(15) } 、{ B(13) C(20) }、 { A(15) C(20) }{ A(5) C(20) } 則不產生。

2-項目序列資料庫

找出2-頻繁時間間隔序列 最小密度：5 (個)最小支持度：15 (個)單元長度：1 輸出頻繁時間間隔序列A  B[1,3] 1.若項目序列AB資料列表沒有產生任何頻繁時間間隔序列，則刪除項目序列AB的資料列表 2.刪減項目序列AB資料列表中投影點不在u1的資料

Advanced Topics in Data Mining: Sequential Patterns