Multi-dimensional Sequential Pattern Mining

Multi-dimensional Sequential Pattern Mining ~From: 10th ACM Intednational Conference on Information and Knowledge Management (CIKM 2001), Atlanta. Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal 碩專二 69121507 阮士峰

Outline • Why multidimensional sequential pattern mining? • Problem definition • UniSeq Algorithms • Dim-Seq and Seq-Dim • Experimental results • Conclusions

Why Sequential Pattern Mining? • Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences) • Many data and applications are time-related • Customer shopping patterns, telephone calling patterns • Natural disasters (e.g., earthquake, hurricane) • Disease and treatment • Stock market fluctuation • Weblog click stream analysis • DNA sequence analysis

A sequence : <(bd) c b (ac)> Seq. ID Sequence Elements 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> Sequential Pattern: Basics A sequence database <ad(ae)> is a subsequence of <a(bd)bcb(ade)> Given support threshold min_sup =2, <(bd)cb> is a sequential pattern

Multi-Dimenesion Sequence Database • If support =2, P is a MD sequential pattern • P=(*,Chicago,*,<bf>) matches tuple 20 and 30

Problem definition • Sequential patterns are useful • “try a 100 hour free internet access package”  “subscribe to 15 hours/mouth package”  “ upgrade to 30 hours/mouth package”  “upgrade to unlimited package” • Marketing, product design & development • Problems: lack of focus • Various groups of customers may have different patterns • MD-sequential pattern mining: integrate multi-dimensional analysis and sequential pattern mining

UniSeq • Embed MD information into sequences Mine the extended sequence database using sequential pattern mining methods Table1 SDB Table2 SDBMD

UniSeq(cont.) • Sequence database SDBMD can be mined using PrefixSpan. • First scan the database, PrefixSpan finds all the single-item frequent sequence. these are <business>:2, <Chicago>:2, <middle>:2, <a>:2, :4, <C>:3, <e>:2 and <f>:2. • The complete set of sequential patterns can then be partitioned into 8 subsets.

UniSeq(cont.) • Ex: the <chicago>-projected database contains two postfix sequences: <(bf)(ce)f> and < middle aabf>. • Then print out the sequential pattern <chicago>, and find this projected database. • They are : and <f>, which form the sequential paterns “<chicago b>:2” and “<Chicago f>:2” respectively. • However, <Chicago b>-projected database contains postfix sequences for:<(-f)f> and <f> with one frequent item between them • find “”<Chicago bf>:2”  (*,Chicago,*,<bf>)

Mine Sequential Patterns by Prefix Projections • Step 1: find length-1 sequential patterns • <a>, , <c>, <d>, <e>, <f> • Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: • The ones having prefix <a>; • The ones having prefix ; • … • The ones having prefix <f>

Find Seq. Patterns with Prefix <a> • Only need to consider projections <a> • <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> • Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> • Further partition into 6 subsets • Having prefix <aa>; • … • Having prefix <af>

Completeness of PrefixSpan SDB Length-1 sequential patterns <a>, , <c>, <d>, <e>, <f> Having prefix <c>, …, <f> Having prefix <a> Having prefix <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> -projected database … Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> … … Having prefix <aa> Having prefix <af> … <aa>-proj. db <af>-proj. db

Efficiency of PrefixSpan • No candidate sequence needs to be generated • Projected databases keep shrinking • Major cost of PrefixSpan: constructing projected databases

Dim-Seq • First find MD-patterns • E.g. (*,Chicago,*) • Form projected sequence database • <(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*) • Find seq. pat in projected database • E.g. (*,Chicago,*,<bf>)

Seq-Dim • Find sequential patterns • E.g. <bf> • Form projected MD-database • E.g. (Professional,Chicago,Young) and (Business,Chicago,Middle) for <bf> • Mine MD-patterns • E.g. (*,Chicago,*,<bf>)

Dim-Seq and Seq-Dim • The problem of multi-dimensional sequential pattern mining problem can reduced to two sub-problem: sequential pattern mining and MD-pattern mining • As introduced before, sequential pattern mining can be done efficiently by PrefixSpan. • For MD-pattern mining, we adopt a BUC-like algorithm.

BUC algorithm • Kevin Beyer , Raghu Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBE, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.359-370, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States

Mining MD-Patterns(BUC-like) (cust-grp,city,age-grp) (cust-grp,city) Cust-grp,*,age-grp) (*,city,*) (*,*,age-grp) (cust-grp,*,*) BUC processing All

Experimental results • Run on Pentium III pc with 1G main memory . • Using Microsoft Visual C++ 6.0 • In this dataset, the number of items is set to 10,000, while the number of sequence is 10,000. The average number of items within each element is 2.5. The average number of elements in one sequence is 8.

Scalability Over Dimensionality

Scalability Over Cardinality

Scalability Over Support Threshold

Scalability Over Database Size

Pros & Cons of Algorithms • Seq-Dim is efficient and scalable • Fastest in most cases • UniSeq is also efficient and scalable • Fastest with low dimensionality • Dim-Seq has poor scalability

Conclusions • MD seq. pat. mining are interesting and useful • Mining MD seq. pat. efficiently • Uniseq, Dim-Seq, and Seq-Dim • Future work • Applications of sequential pattern mining

報告結束

References (1) • R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94, pages 487-499. • R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, pages 3-14. • Kevin Beyer , Raghu Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBE, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.359-370, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States • C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32-38, 1998. • M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. VLDB'99, pages 223-234. • J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, pages 106-115. • J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. FreeSpan: Frequent pattern-projected sequential pattern mining. KDD'00, pages 355-359.

References (2) • J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, pages 1-12. • H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional intertransaction association rules. DMKD'98, pages 12:1-12:7. • H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997. • B. "Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, pages 412-421. • J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, pages 215-224. • R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3-17.

Multi-dimensional Sequential Pattern Mining

Multi-dimensional Sequential Pattern Mining

Presentation Transcript

Mining Sequential Patterns

Association Rule Mining Multi Level And Multi Dimensional Association Rule Mining

Mining Sequential Patterns

Mining Sequential Patterns

Sequential PAttern Mining using A Bitmap Representation

Mining Sequential Patterns

Multi-dimensional Sequential Pattern Mining

Our New Progress on Frequent/Sequential Pattern Mining

Sequential Pattern Mining

Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions

a multi-scale, pattern-based approach to sequential simulation

Sequential Pattern Mining

SeqStream: Mining Closed Sequential Pattern over Stream Sliding Windows

Sequential Data Mining

COBRA: Closed Sequential Pattern Mining Using Bi-phase Reduction Approach

Sequential PAttern Mining using A Bitmap Representation

Sequential Pattern Mining

Mining Sequential Patterns

Mining Sequential Patterns

Multi-Dimensional View of Data Mining

Privacy Preserving Collaborative Sequential Pattern Mining

Our New Progress on Frequent/Sequential Pattern Mining