290 likes | 328 Views
Explore the GSP algorithm for mining sequential patterns with customizable constraints and taxonomies. Comparison with AprioriAll algorithm and practical examples provided.
E N D
Mining Sequential Patterns:Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by: M.H. Lin
Outline • Motivation • Objective • Introduction • Problem Statement • The New Algorithm: GSP • Performance Evaluation • Conclusion • Personal Opinion
Motivation • The problem of mining sequential patterns was recently introduced. • Limitations of the AprioriAll [Agrawal, 1995] • Absence of time constraints • Rigid definition of a transaction • Absence of taxonomies
Objective • We present GSP, a new algorithm that discovers these generalized sequential patterns • Empirically compared the performance of GSP with the AprioriAll algorithm.
Introduction • Instance • A database of sequences, called data-sequences • Each sequence is a list of transactions ordered by transaction-time • Each transaction is a set of items • Definitions: • A sequential pattern consists a list of itemsets • Support:the number of data-sequences that contain the pattern • Problem: • To discover all the sequential patterns with a user-specified minimum support
Example Of A Sequential Pattern • Database of book-club, each data-sequence corresponds to a given customer’s all book selection, each transaction contains the books selected by the given customer in one order • A sequential pattern: 5% of customers bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’
Features of A Sequential Pattern • E.g: 5% cust. bought ‘Foundation’, then ‘Foundation and Empire’ and ‘Ringworld’, then ‘Second Foundation’ • The Maximum and/or minimum time gaps between adjacent elements. • Eg: the time between buying ‘Foundation’, and then ‘Foundation and Empire’ and ‘Ringworld’ should be within 3 months • A sliding time window over the sequence-pattern elements • E.g.: one week • Mo: BK-a Sa: BK-b Next Su: BK-c ; • This data-sequence supports the pattern “BK-a” and “ BK-b”, then “BK-c” • User-defined Taxonomies • Example coming soon….
A User-defined Taxonomy • A customer who bought Foundation,then Perfect Spy, would support the following patterns: • Foundation, then Perfect Spy • Asimov, then Perfect Spy • Science Fiction, then Le Carre • …
The Old Algorithm--AprioriAll • A 3-phase algorithm • Phase 1: finds all frequent itemsets with min. support • Phase 2: transforms the DB s.t. each transaction only contains the frequent itemsets • Phase 3: finds sequential patterns • Pros. • Can Discover all frequent sequential patterns • Cons. • Computationally expensive: space, time • Not feasible to incorporate sliding windows
Problem Statement • Definitions: • Let I = {i1,i2,…,im} be a set of literals, calleditems • Let T be a directed acyclic graph on the literals. • An itemsetis a non-empty set of items • A sequence is an ordered list of itemsets • We denote a sequence s by <s1s2…sn>, where sj is an itemset. • We denote an element of sequence by (x1,x2,…,xm), where xj is an item. • A sequence <a1a2…an> is a subsequence of another sequence <b1b2…bm> if there exist integers i1<i2<…<in such that a1 bi1 , a2 bi2 , …, an bin. • E.g:<(3)(4,5)(8)> is a subsequence of <(7)(3,8)(9)(4,5,6)(8)> • E.g:<(3)(5)> is not a subsequence of <(3,5)>
Problem Statement(contd.) • A data-sequence contains a sequence s if s is a subsequence of the data-sequence. • Plus taxonomies: • a transaction T contains an item x I if x is in T or x is an ancestor of some item in T. • Plus sliding windows: • A data-sequence d= <d1…dm> contains a sequence s = <s1…sn> if there exist integers l1≤u1<l2≤u2<…<ln ≤un such that • 1. siis contained in , 1 ≤ i ≤ n , and • 2. transaction-time(dui) – transaction-time(dli) ≤window-size , 1 ≤ i ≤ n • Plus time constraints: • 3. transaction-time(dli) - transaction-time(dui-1) > min-gap, 2 ≤ i ≤ n, and • 4. transaction-time(dui) - transaction-time(dli-1) ≤ max-gap, 2 ≤ i ≤ n.
Problem Definition • Input: • Database D : data sequences • Taxonomy T : a DAG, not a tree • User-specified min-gap and max-gap time constraints • A user-specified sliding window size • A user-specified minimum support • Goal: • To find all sequences whose support is greater than the given support
Example • minimum support: 2 data-sequences • With the AprioriAll • <(Ringworld)(Ringworld Engineers)> • Sliding-window of 7 days adds the pattern • <(Foundation, Ringworld)(Ringworld Engineers)> • Max-gap of 30 days • both patterns dropped • Add the taxonomy, no sliding-window or time constraints, one is added • <(Foundation)(Asimov)>
GSP:Basic Structure • Phase 1: makes the first pass over database • To yield all the 1-element frequent sequences • Phase 2: the kth pass: • starts with seed set found in the (k-1)th pass to generate candidate sequences, which has one more item than a seed sequence; • A new pass over D to find the support for these candidate sequences • These frequent candidates become the seed for the next pass • Phase 3: terminates when • no more frequent sequences are found • no candidate sequences are generated
GSP: implementation • Generating Candidates: • To generate as few candidates as possible while maintaining completeness • Counting Candidates: • To determine the candidate sequence’s support • Implementing Taxonomies
Candidate Generation • Definition: • K-sequence : a sequence with k items, • Lk : the set of frequent k-sequences, • Ck : the set of candidate k-sequences • Goal: given the set of all frequent (k-1)-sequences, generate a candidate set of all frequent k-sequences • Algorithm: • Join Phase: joining Lk-1with Lk-1 . s1 can join with s2 if (s1– first item) is the same as (s2 – last item) • Prune Phase: delete candidate sequences that have a contiguous (k-1) subsequence whose support count is less than the minimum support
Candidate Generation: Example • Join phase: • <(1,2)(3)>joins with <(2)(3,4)> => <(1,2)(3,4)> • <(1,2)(3)>joins with <(2)(3)(5)> => <(1,2)(3)(5)> • Prune phase: • <(1,2)(3)(5)> is dropped => <(1)(3)(5)> is not in L3
Counting Candidates • Problem: given a set of candidate sequences C and a data sequence d, find all sequences in C that are contained in d. • Two techniques are used • Hash-tree data structure: to reduce the number of candidates in C that need to be checked. • Transformation the representation of the data-sequences d : to find whether a specific candidate is a subsequence of d efficiently.
Hash-Tree Structure • Purpose: reducing the number of candidates • Leaf node: a list of sequences • Interior node: a hash table • Operations: • Adding candidate sequences to the hash-tree • Finding the candidates contained in a data-sequence • Min-gap • Max-gap • Sliding window size
Representation Transformation • Purpose: to efficiently find the first occurrence of an element • Transform the data sequences into transaction-links, each link is identified by one item • E.g.:max-gap=30,min-gap=5,window-size=0,<(1,2)(3)(4)> • E.g.:window-size:7,find(2,6) after time=20
Implementing Taxonomies • Basic Idea: • to replace each data-sequence d with an “extended sequence” d’, where each transaction di ’ contains all the items in the corresponding transaction di ,as well as all their ancestors. • E.g.:<(Foundation, Ringworld)(Second Foundation)> => <Foundation,Ringworld,Asimov,Niven,Science Fiction)(Second Foundation,Asimov,Science Fiction)> • Optimizations • Pre-compute the ancestors of each item, drop infrequent ancestors before a new pass • Not count patterns with an element that contains an item x and its ancestor y • Problem: redundancy • E.g.
Performance Evaluation • Comparison of GSP and AprioriAll • Result: 2 to 20 times faster • Contributing factors: • Fewer candidates • Directly finding the candidates • Scale-up: • scales linearly with the number of data-sequences • Effects of Time Constraints and Sliding Windows: • there was no performance degradation
Conclusion • GSP is a Generalized Sequence Mining Algorithm • Discovering all the sequential patterns • Good Customizability • Has been incorporated into IBM’s data mining product
Personal Opinion • Hash-tree Structure: main memory limitation • Multi-pass over the database • Apply GSP to CIS data