1 / 28

Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos

Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos Presented by George Liu / Luis L. Perez. Time series?. Definition Applications Financial markets Weather forecasting Healthcare. What kind of problem are we trying to solve?.

norm
Download Presentation

Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos Presented by George Liu / Luis L. Perez

  2. Time series? • Definition • Applications • Financial markets • Weather forecasting • Healthcare

  3. What kind of problem are we trying to solve? • Whole sequence matching • Given a database S with n sequences, all of them equally long, and a query sequence Q of the same length. • Find all sequences in S that match with Q. • Subsequence matching • Given a database S with n sequences, with potentially different lengths, and a query sequence Q. • Find all sequences in S that contain Q.

  4. Useful notation • Given a sequence S • Len(S) denotes the length of the sequence • S[i] denotes the ith element • S[i:j] denotes the subsequence between S[i] and S[j] • Given two sequences, S and Q • D(S,Q) denotes the distance between S and Q. • Euclidean • Distance bound: e • Max. distance for two sequences to be considered “equal”

  5. Naïve approaches • Sequential scanning • Clearly unfeasible • R-tree • Might work, but dimensionality is extremely high (proportional to sequence length)‏ • Poor performance • What can we do to improve performance?

  6. Dimensionality reduction • Redundant data, lots of patterns • Feature extraction • Data transformation • Cosine • Wavelet • Fourier <-- we'll focus on this.

  7. Discrete Fourier Transformation • Map a sequence x in time-domain to a sequence X in frequency-domain • Reversible! • Fast and easy-to-implement algorithms • Energy preservation property • Key concept in dimensionality reduction. • Just keep the first 2 or 3 coefficients.

  8. Parseval's theorem • Let S and Q be the original sequences. • S' and Q' after applying DFT. D(S,Q) = D(S',Q') • Why is this important? • Distance underestimation, remember the bound e. • D(S,Q) < e ---> D(S', Q') < e • We will get no false dismissals.

  9. Subsequence Matching • The problem: • You are given a collection of N sequences of real numbers. (S1, S2, .., Sn). Potentially different length. • User specifies query subsequence of length Q and the tolerance e, the max. acceptable dis-similarity. • You want all to return all the sequences along with the correct offsets k that matches the query and acceptable e. • Solutions: • many!

  10. Possible Solutions • 1) Brute Force method - Sequential scan every possible subsequence of the data sequences for a match. • 2) I-Naive - Transform all subsequences to points in feature space and store those points into an R-tree. • 3) ST-Index - Transform all subsequences to points in feature space. Store MBRs of sub-trails into an R*-tree. • Note: I-Naive and ST-Index are similar in the initial steps.

  11. Possible Solutions I-naive • *Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns)‏ • Procedure: • 1) Use the "sliding window" to find every subsequence in a sequence. • 2) DFT those subsequences of size w to a point in featured space. • 3) A trail is produced of Len(S)-w+1 points.

  12. Possible Solutions I-naive • Procedure cont: • 4) Store all the points of the trails in feature space in a spatial access method. (R*-tree)‏ • 5) When presented with a query of length w and tolerance e, extract the features of the query and perform the spatial access range query with radius e. • 6) Discard false alarms by retrieving all those subsequences and calculating their actual distance from the query. • Note: Very, very slow approach. Worst that Sequential Scan. You have a large R*-tree (tall and slow).

  13. Possible Solutions ST-Index • *Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns)‏ • Procedure: • 1) Use the "sliding window" to find every subsequence in a sequence. • 2) DFT those subsequences of size w to a point in featured space. • 3) A trail is produced of Len(S)-w+1 points.

  14. Possible Solutions ST-Index • Procedure cont. • 4) Divide the trail of points in feature space into sub-trails. (algorithm mentioned later)‏ • 5) Represent each of them in a MBR. • 6) Store the MBR into a spatial access method. (ie. R*-Tree)‏

  15. MBRs in F-Dimension

  16. MBRs in F-Dimension

  17. MBRs in F-Dimension

  18. MBRs in F-Dimension

  19. MBRs in F-Dimension

  20. Insertions • Problem: How do we divide these trails into sub-trails? • Two heuristics: • 1) Every sub-trail has a predetermined, fixed number. (I-fixed)‏ • 2) Every sub-trail has a predetermined, fixed length. (I-fixed)‏ • Solution: Use an "adaptive heuristic." (I-adaptive)‏

  21. I-adaptive Algorithm • - Based on the idea of the marginal cost of a point in terms of disk accesses. Marginal cost (mc) = Disk Accesses of a given MBR / k points in a given MBR • Algorithm Assign the first point of the trail in a sub-trail. FOR each successive point IF it increase the marginal cost of the current sub-trail THEN start another sub-trail ELSE include it in the current sub-trail

  22. I-adaptive Algorithm

  23. Searching • Consider the sub-trail length w and distance bound e. • Let Q be the query sequence • If Len(Q) = w, it's all good. • Algorithm Search_Short: • Use DFT to map Q to a point q in feature space. Make it a sphere with radius e. • Retrieve all the sub-trails whose MBRs intersect the query region using our index. • Throw away false alarms.

  24. Searching • Now, what if Len(Q) > w? • Requires more analysis, but basically we have that Len(Q) = p*w • So we can split Q in several subsequences of length p. • What about the radius? r = e/sqrt(p)‏

  25. Searching • So we have... • Algorithm Search_Long: • Break sequence Q in p sub-queries with radius e/sqrt(p)‏ • Retrieve from the index all the sub-trails whose MBRs insersect at least one of the other sub-query regions. • Examine the sub-sequences, discard false alarms.

  26. Experimental results

  27. Experimental results • Stock price database with ~300,000 points • 1 number = 4 bytes • DFT keeping first 3 coefficients (actually 6) • w = 512 bytes • R*-tree

  28. Experimental results • Space • Naïve methods: 24mb • This method: 5kb • Time - “short” queries (Len(Q) = w)‏ • 3 to 100 times better response times • Time - “long” queries (Len(Q) > w)‏ • 10 to 100 times better response times

More Related