Mining Sequential Patterns

Mining Sequential Patterns Rakesh Agrawal & Ramakrishnan Srikant Presented By: Saleha Rizvi Sadia Masud

Outline • Introduction • Terminology • Formal Definition of MSP • Finding Sequential Patterns • 3 Algorithms • Performance Evaluation • Conclusion

Introduction • X->Y i.e. if ‘x’ exists ‘y’ will also follow in future with specified support. (laptop-> battery) • When things happen over time and in a sequence All transactions have a timestamp

Introduction Assumptions • No more than one transaction at same time. • Quantity not important, occurrence of an item is. • If a customer bought same item in different transaction but in same sequence its counted only once. Properties: • Find inter-transaction patterns • Patterns consist of orderedlist of items (a, b) !=(b ,a) • Patterns need not be consecutive/contiguous e.g. T= (a, b, c, d) all (a, c, d) , (a, d), (b, d) are valid sequences of T • Element of patterns could be a single item or sets of items

Terminology • Transaction: set of items bought at a time <customer id, time, items bought > • Itemset: non-empty set of items • eg. itemset I: (i1i2…im), each ij = an item • Sequence: ordered list of itemsets • eg. sequence s <s1s2…sn >, each si= an itemset Example For John Time 1: <(laptop, battery)> Time 2 : <(laptop bag)> Time 3 : <(Norton Antivirus)> D – database of transactions ( t2 ) ~ <(laptop bag)> Sequence for John becomes <(laptop, battery) (laptop bag) (Norton Antivirus)>

Terminology • Sequence containership: a sequence a1a2…an is contained in another sequence b1b2…bm iff each itemset ‘ai’ is a subset of any itemset of ‘bi’ a = < (4) (7 9 10) (1 3 6) > b = <(8) (4 2 )(2 7 8 9 10) ( 2 1 4 3 7 9 6)> c = < (4 11) (7 9 10) (1 3 6) > a contained in b. c !contained in b d= <(4)(2)> 4 and 2 bought in different transactions e= <(4,2)> 4 and 2 bought in same transaction d not contained in e and vice versa

Terminology • Maximal Sequence : not contained in other sequence. a = < (4) (7 9 10) (1 3 6) > not maximal -contained in b c= < (4) (7 9 10) (1 3 6 5) > -maximal b = <(8) (4 2 )(2 7 8 9 10) ( 2 1 4 3 7 9 6)> • Customer’s support for sequence : customer supports ‘s’ if ‘s’ contained in customer sequence e.g. John supports sequence ‘r’ but not ‘s’ John’s sequence <(laptop, battery) (laptop bag) (Norton Antivirus)> r=<(laptop )(laptop bag)> s=<(laptop )(laptop bag) (notebook cam)>

Terminology • Support for sequence : Number of customers who support that sequence • e.g. database - 100 customer • 10 customer support sequence ‘r’ • Then, support for sequence r is 10% • different from ARM where support is in term of occurrences!! • Large Sequence: Sequence satisfying minimum support • Length of a sequence: No. of itemsets in a sequence i.e. k-sequence => sequence with k itemsets.

Formal definition of MSP • Given D (Database of transactions) “Find maximal sequences that meet user specified minimum support”. • Involves 2 steps • Find all maximal sequences • Eliminate those which do not meet minimum support .

Finding Sequential Patterns • Solution has 5 Phases • Sort Phase • Litemset phase • Transformation phase • Sequence Phase • Maximal phase

Sort Phase • Sort the Database in ascending order • customer id -> primary key • transaction time ->secondary key • All transactions belonging to one customer in contiguous tuples. • Express the database as a set of customer-sequences.

An Example

Example: Customer Sequence Minimum support = 40% ~ 2 customers < (30)(90)> and <(30)(40 70)> => have min support & maximal => Desired Sequential Patterns <(30)>, <(30) (40)>, <(40)(70)>, etc. => have min support but not maximal customer sequences with no minimum support : <(10 20 )(30)>, <(30 50 70)>

Litemset Phase • Some Definitions: • Support for an itemset: fraction of customers who bought the items in that itemset in a singletransaction; • litemset or large itemset: An itemset with minimum support • Note: A large sequence must be a list of litemsets (i.e., each itemset in a large sequence must have minimum support) • Find the set of all litemsets L (I.e., all large 1-sequences) • Using Apriori Algorithm • Note: support count should be incremented only once per customer • Map the set of litemsets to a set of contiguous integers (Optimizes time)

Transformation Phase • Each transaction is replaced by the set of all litemsets in that transaction. • If no litemsets in a transaction => delete transaction. • If a customer sequence does not contain any large itemset => delete • Note: it still counts towards total number of customers. • Each litemset is then replaced by its mapped integer from litemset phase

Transformation Phase Not a large itemset Only large itemsets A large itemset

Sequence Phase • Use the litemsets from transformed DB to find the desired sequences • To achieve this: • Multiple passes over data • In each pass use the large sequences from the previous pass to generate new potentially large sequences/candidate sequences (like ARM). • Find support count for candidate sequences • End of pass find which of these candidate sequences are actually large/meet minimum support.

Maximal Phase • Find the maximal sequences among the set of large sequences • Often combined with the Sequence Phase • S = set of all large sequences, n = length of the longest sequence • Algorithm to find maximal sequences:

Sequence Phase: Detailed • Two Approaches • Count All : Includes Counting of non-maximal sequences => Pruned out in Maximal Phase • Count Some : Count Only maximal sequences; avoid counting non-maximal sequences. (No Maximal Phase).

Sequence Phase (Contd.) • Three Algorithms: • AprioriAll (Count All) • AprioriSome (Count Some) • DynamicSome (Count Some)

Algorithm AprioriAll • In the first pass, output of litemset phase is used to initialize the set of large 1-sequences

AprioriAll Candidate Generation • Apriori-generate function: • Join step: join Lk-1 with Lk-1 • Prune step: delete all sequences c in Ck such that some (k-1)-subsequence of c is not in Lk-1 • Eg. <1 2 4 3> is pruned because <2 4 3> is not in L3

Another Example … • Minimum support 40% (i.e., 2 customer sequences) • Get L1, L2, L3, L4 from Sequence Phase • Maximal Phase: Get the maximal large sequences: <1 2 3 4>, <1 3 5> and <4 5> => Seq. Patterns

AprioriSome • Two Phases • Forward • Count candidate sequences of certain length • Backward • Delete non-maximal sequences • Count candidates of lengths not in forward phase • Delete non-maximal sequences from forward phase

DynamicSome • Three Phases • Forward • Count candidate sequences of certain length • Intermediate Phase • Generate candidates from large sequences and database (customer sequences) –On the fly candidate generation • Backward • Delete non-maximal candidates from forward phase • Count candidate sequences of lengths not used in forward phase. • Delete non-maximal candidate sequences from backward phase

Candidate Generation • Difference in Candidate Generation • AprioriAll- counts all length candidates/sequences (non-maximal too) • AprioriSome- counts some length sequences based on previous large ones. Prunes non-maximal sequences • DynamicSome-generate candidates using large sequences and database (customer sequences). Doesn’t prune the non-maximal sequences

Performance Evaluation-Execution Time(ET) w.r.t. minimum support • ET increases with decrease in min support for all algorithms due to increase in number of large sequences. • DynamicSome performs worse • Generates and counts larger no of candidates (no pruning of non-maximal sequences) • AprioriSome and AprioriAll are comparable. • AprioriSome performs a little better for lower minimum support values

Performance EvaluationContd.. • AprioriAll counts all large sequences,including non-maximal sequences (which are later pruned out); • AprioriSome avoids counting many non-maximal sequences by counting longer sequences first (major advantage over AprioriAll)

Performance Evaluation- ET Vs Scale Up • Scale up-Number of customers-Number of transactions per customer Levels of minimum Support • Excellent scale-up property: Execution Time scales linearly with the increase in number of customers.

Performance Evaluation- Scale Up Support : No of transactions • Number of transactions per customer increased keeping no. of items per transaction fixed • Execution time increases gradually • Large sequences increase with increase in customer sequence

MSP vs ARM • Similarities • In both the cases we mine patterns of the form • X->Y which means if X exists Y also exists with certain support • Dissimilarities • Order is important in MSP but not in ARM • ARM => intra transaction MSP => inter transaction (over a period) • ARM => no timestamp

Application of MSP • Discovering frequent buying patterns • Large Retail Organizations, Catalog Companies, etc. • Analysis of patients medical records /predict sequential symptoms of a disease • Web-access pattern

Conclusion • Advantage • Complete: finds ALL sequential patterns • Scaling: can handle very large databases with millions of customer sequences • Disadvantage • Can’t find patterns related to items of interest (i.e. CAR) • Future Extensions • Sequential Patterns across item categories • Transposition of constraints into the discovery • item constraints: sequential patterns with specific items • Time constraints: minimum and maximum gap between adjacent elements of the patterns

References • R.Agrawal and R.Srikant. Mining Sequential Patterns.In Proc. Of thr 11th Int’l Conference on Data Engineering, Taipei, Taiwan, March 1995. • R.Agrawal and R.Srikant .Mining Sequential Patterns: Generalizations and Performance ImprovementProceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, ,Avignon, France, March 25-29, 1996 • M Wojciechowski . Discovering and Processing Sequential Patterns in Databases.VII Conference on Extending Database Technology - PhD Workshop, Konstanz, Germany, 2000.

Mining Sequential Patterns