280 likes | 405 Views
Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos Presented by George Liu / Luis L. Perez. Time series?. Definition Applications Financial markets Weather forecasting Healthcare. What kind of problem are we trying to solve?.
E N D
Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos Presented by George Liu / Luis L. Perez
Time series? • Definition • Applications • Financial markets • Weather forecasting • Healthcare
What kind of problem are we trying to solve? • Whole sequence matching • Given a database S with n sequences, all of them equally long, and a query sequence Q of the same length. • Find all sequences in S that match with Q. • Subsequence matching • Given a database S with n sequences, with potentially different lengths, and a query sequence Q. • Find all sequences in S that contain Q.
Useful notation • Given a sequence S • Len(S) denotes the length of the sequence • S[i] denotes the ith element • S[i:j] denotes the subsequence between S[i] and S[j] • Given two sequences, S and Q • D(S,Q) denotes the distance between S and Q. • Euclidean • Distance bound: e • Max. distance for two sequences to be considered “equal”
Naïve approaches • Sequential scanning • Clearly unfeasible • R-tree • Might work, but dimensionality is extremely high (proportional to sequence length) • Poor performance • What can we do to improve performance?
Dimensionality reduction • Redundant data, lots of patterns • Feature extraction • Data transformation • Cosine • Wavelet • Fourier <-- we'll focus on this.
Discrete Fourier Transformation • Map a sequence x in time-domain to a sequence X in frequency-domain • Reversible! • Fast and easy-to-implement algorithms • Energy preservation property • Key concept in dimensionality reduction. • Just keep the first 2 or 3 coefficients.
Parseval's theorem • Let S and Q be the original sequences. • S' and Q' after applying DFT. D(S,Q) = D(S',Q') • Why is this important? • Distance underestimation, remember the bound e. • D(S,Q) < e ---> D(S', Q') < e • We will get no false dismissals.
Subsequence Matching • The problem: • You are given a collection of N sequences of real numbers. (S1, S2, .., Sn). Potentially different length. • User specifies query subsequence of length Q and the tolerance e, the max. acceptable dis-similarity. • You want all to return all the sequences along with the correct offsets k that matches the query and acceptable e. • Solutions: • many!
Possible Solutions • 1) Brute Force method - Sequential scan every possible subsequence of the data sequences for a match. • 2) I-Naive - Transform all subsequences to points in feature space and store those points into an R-tree. • 3) ST-Index - Transform all subsequences to points in feature space. Store MBRs of sub-trails into an R*-tree. • Note: I-Naive and ST-Index are similar in the initial steps.
Possible Solutions I-naive • *Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns) • Procedure: • 1) Use the "sliding window" to find every subsequence in a sequence. • 2) DFT those subsequences of size w to a point in featured space. • 3) A trail is produced of Len(S)-w+1 points.
Possible Solutions I-naive • Procedure cont: • 4) Store all the points of the trails in feature space in a spatial access method. (R*-tree) • 5) When presented with a query of length w and tolerance e, extract the features of the query and perform the spatial access range query with radius e. • 6) Discard false alarms by retrieving all those subsequences and calculating their actual distance from the query. • Note: Very, very slow approach. Worst that Sequential Scan. You have a large R*-tree (tall and slow).
Possible Solutions ST-Index • *Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns) • Procedure: • 1) Use the "sliding window" to find every subsequence in a sequence. • 2) DFT those subsequences of size w to a point in featured space. • 3) A trail is produced of Len(S)-w+1 points.
Possible Solutions ST-Index • Procedure cont. • 4) Divide the trail of points in feature space into sub-trails. (algorithm mentioned later) • 5) Represent each of them in a MBR. • 6) Store the MBR into a spatial access method. (ie. R*-Tree)
Insertions • Problem: How do we divide these trails into sub-trails? • Two heuristics: • 1) Every sub-trail has a predetermined, fixed number. (I-fixed) • 2) Every sub-trail has a predetermined, fixed length. (I-fixed) • Solution: Use an "adaptive heuristic." (I-adaptive)
I-adaptive Algorithm • - Based on the idea of the marginal cost of a point in terms of disk accesses. Marginal cost (mc) = Disk Accesses of a given MBR / k points in a given MBR • Algorithm Assign the first point of the trail in a sub-trail. FOR each successive point IF it increase the marginal cost of the current sub-trail THEN start another sub-trail ELSE include it in the current sub-trail
Searching • Consider the sub-trail length w and distance bound e. • Let Q be the query sequence • If Len(Q) = w, it's all good. • Algorithm Search_Short: • Use DFT to map Q to a point q in feature space. Make it a sphere with radius e. • Retrieve all the sub-trails whose MBRs intersect the query region using our index. • Throw away false alarms.
Searching • Now, what if Len(Q) > w? • Requires more analysis, but basically we have that Len(Q) = p*w • So we can split Q in several subsequences of length p. • What about the radius? r = e/sqrt(p)
Searching • So we have... • Algorithm Search_Long: • Break sequence Q in p sub-queries with radius e/sqrt(p) • Retrieve from the index all the sub-trails whose MBRs insersect at least one of the other sub-query regions. • Examine the sub-sequences, discard false alarms.
Experimental results • Stock price database with ~300,000 points • 1 number = 4 bytes • DFT keeping first 3 coefficients (actually 6) • w = 512 bytes • R*-tree
Experimental results • Space • Naïve methods: 24mb • This method: 5kb • Time - “short” queries (Len(Q) = w) • 3 to 100 times better response times • Time - “long” queries (Len(Q) > w) • 10 to 100 times better response times