180 likes | 210 Views
Explore sequential pattern mining in databases with a focus on frequent sequences using algorithms like Apriori, GSP, SPADE, and DFS-Mine. Learn the concepts of support, maximal sequences, and lattice structures. Discover how candidate sequences are examined and generated.
E N D
Mining Sequential Patterns Dimitrios Gunopulos, UCR
Finding Frequent Sequential Patterns • The problem: Given a sequence of discrete events that may repeat: A B A C D A C E B A B C… Find patterns that repeat frequently. • For example: A followed by B (A->B), or A followed by C (A->C) The patterns should occur within a window W. • Applications in telecommunication data, networks, biology
time t1 Later time t2 item item attribute value itemset item Sequences • Sequence ((T=90F) (H=60%, P=1.1atm)) • k-sequence: sequence with k items • T1H2P1T3P2, P1T2H4P2T5: 5-sequences • S1 is subsequence of S2 (S1 S2) • T1P1T2 H1T1P2H2P1T2 (T1H1T1 , P1T2H2P1T2) • H1P1T2 H1T1P2H2P1T2
Sequential Patterns: The Problem • support or frequency of a sequence S ((S)): • = the total number of times sequence S is encountered • user specified minimum threshold min_sup • S is frequent (S) min_sup • S:maximal frequent sequenceS is frequent and all of its supersequences are non-frequent • S:minimal non-frequent sequenceS is non frequent and all of its subsequences are frequent • The problem • Given: database D and min_sup • the problem: find all frequent sequences in D
Algorithms for Sequential Patterns • Apriori, GSP [Srikant, Agrawal, EDBT 1996] [Mannila, Toivonen, Verkamo, DMKD 1997] • SPADE, Parallel Spade [Zaki, 2001] • FreeSpan, PrefixSpan [Han et al, SIGKDD 2000], [Pei et al, ICDE 2001] • Sequential Patterns with constraints [Garofalakis et al, VLDB 99] • DFS-Mine [Tsoukatos and Gunopulos, SSTD 2001]
The Lattice Structure • Lemma: All subsequences of a frequent sequence are frequent
SPADE ([Zaki, 2001]) • Lattice-based approach • vertical id-list format • enumerates all frequent sequences equivalence classes to decompose the problem: • two k-sequences belong in the same []i class if they have the same i-length prefix • each class fits in main memory • generates a (k+1)-sequence by intersecting two k-sequences that have common (k-1)-length prefix • minimizes I/O cost - 2 database scans: • frequent 1-sequences, frequent 2-sequences
DFS_MINE • Depth-First-Search approach • fast discoveries of long maximal frequent patterns • uses minimal amount of memory • some frequent sequences are deduced to be frequent from lattice • candidate (k+1)-sequence: intersect a k-sequence with all frequent items (FreqItems) • in main memory: • S.Useless: set of items sequence S must not be intersected with • MaxFreqList: List of Maximal Frequent Sequences • MinNonFreqList: List of Minimal Non Frequent Sequences • scan database to determine the support of candidate sequences
In MaxFreqList ABCDE candidate BCD BCD candidate CD In MinNonFreqList MaxFreqList - MinNonFreqList Lemma: All subsequences of a frequent sequence are also frequent • S is inserted in MaxFreqList if: • S is not in MaxFreqList • S is not a subsequence of a sequence in MaxFreqList • S was scanned in database and was found to be frequent • Subsequences of S in MaxFreqList are removed. • Supersequences S is inserted in MinNonFreqList if: • S is not in MinNonFreqList • S is not a supersequence of a sequence in MinNonFreqList • S was scanned in database and was found to be non-frequent • Supersequences of S in MinNonFreqList are removed.
Examining Candidate Sequences • k-sequence S is intersected with all items Ijin FreqItems-S.Useless • resulting sets SET(S+Ij) for all Ij • each sequence S: • check MinNonFreqList • check MaxFreqList • scan database for all unknown sequences (if any) in SET(S+Ij) for all Ij(1pass) • update MaxFreqList, MinNonFreqList
D AAA D ADAA AAAD ADAAD ADAAD D D Generating sequences • k-sequence S + Ijin FreqItems-S.Useless = candidate (k+1)-sequences ABCD + E 1. EABCD 2. AEBCD 3. AEBCD 4. ABECD 5. ABECD 6. ABCED 7. ABCED 8. ABCDE 9. ABCDE ABCD + D 1. DABCD 2. ADBCD 3. ADBCD 4. ABDCD 5. ABDCD 6. ABCDD 7. ABCDD 8. ABCDD 9. ABCDD • insert item Ij in all possible positions that follow its rightmost occurrence is a k-sequence S. If the item does not occur at all in the sequence, then it is inserted in all positions.
SET(S+A,A) SET(S+A,B)=SET(S+B,A) SET(S+D,E) SET(S+B,B) A B A B SET(S+A) SET(S+D) A B SET(S+B) SET(S+E) E Sequence S Sequence S D E Useless Set of a sequence S • after intersecting S with item Ij, it is inserted in S.Useless • when intersecting S with item Ij, all items Ik(k<j) are in S.Useless • S.Useless is ‘inherited’ by the (k+1)-sequences produced DAB +E EDAB DEAB DEAB DAEB DAEB DABE DABE Bound to be not frequent Scenario 1 AB+D DAB ADB ADB ABD ABD not frequent AB+E EAB AEB AEB ABE ABE Scenario 2
Open Problems • Output subexponential maximal sequential pattern algorithms • Efficient algorithms for finding episodes (approximate sequential patterns – edit distance)
Temperature Map US Snow-ice-rain radar US Snow-ice-rain radar NE Precipitation radar Bay Area Precipitation radar Lakes Spatiotemporal Datasets
Mining Spatiotemporal Data • CONQUEST, [Stolorz et al, KDD 1995] • Patterns in global climate change • SKICAT, [Fayyad et al, 1996] • Image processing techniques and classification techniques to identify objects in satellite pictures • GeoMiner [Han et al, 1997] • MultiMediaMiner, [Zaiane et al, 1998] • Data Cube structure. Mining of association and classification rules. • DFS-Mine, [Tsoukatos et al, 2001] • Discovery of spatiotemporal patterns
Open Problems • Similarity models and indexing techniques for higher-dimensional time series • Efficient trend detection/subsequence matching algorithms • Algorithms to capture the data distribution when it changes over time • New models for capturing the evolution of spatial phenomena over time