1 / 32

F ast S ubsequence M atching in T ime -S eries D atabases

F ast S ubsequence M atching in T ime -S eries D atabases. Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University of Maryland at College Park. Presented by Rui Li. Abstract.

margot
Download Presentation

F ast S ubsequence M atching in T ime -S eries D atabases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University of Maryland at College Park Presented by Rui Li

  2. Abstract • Goal: To find an efficient indexing method to locate time series in a database • Main Idea: • Map each time series into a small set of multidimensional rectangles in feature space • Rectangles can be readily indexed using traditional spatial access methods, e.g., R*-tree

  3. Introduction • Hot Problem: Searching similar patterns in time-series databases • Applications: • financial, marketing and production time series, e.g. stock prices • scientific databases, e.g. weather, geological, environmental data

  4. Introduction (cont.) • Similarity Queries: • Whole Matching • Subsequence Matching • partial matching • report time series along with offset

  5. Introduction (cont.) • Whole Matching (Previous Work) • Use a distance-preserving transform (e.g., DFT) to extract f features from time series (e.g., the first f DFT coefficients), and then map them into points in the f-dimensional feature space • Spatial access method (e.g., R*-trees) can be used to search for approximate queries

  6. Introduction (cont.) • Subsequence Matching (Goal) • Map time series into rectangles in feature space • Spatial access methods as the eventual indexing mechanism

  7. Background • To guarantee no false dismissals for range queries, the feature extraction function F() should satisfy the following formula: • Parseval Theorem: • The DFT preserves the Euclidean distance between two time series

  8. Proposed Method • Mapping each time series to a trail in feature space • Use a sliding window of size w and place it at every possible offset • For each such placement of the window, extract the features of the subsequence inside the window • A time series of length L is mapped to a trail in feature space, consisting of L-w+1 points: one point for each offset

  9. Example1

  10. Example2 (a) a sample stock-price time series (b) its trail in the feature space of the 0-th and 1-st DFT coefficients (c) its trail of the 1-st and 2-nd DFT coefficients

  11. Proposed Method (cont.) • Indexing the trails • Simply storing the individual points of the trail in an R*-tree is inefficient • Exploit the fact that successive points of the trail tend to be similar, i.e., the contents of the sliding window in nearby offsets tend to be similar • Divide the trail into sub-trails and represent each of them with its minimum bounding (hyper)-rectangle (MBR) • Store only a few MBRs

  12. Proposed Method (cont.) • Indexing the trails (cont.) • Can guarantee ‘no false dismissals’: when a query arrives, all the MBRs that intersect the query region are retrieved, i.e., all the qualifying sub-trails are retrieved, plus some false alarms

  13. Return to example1 ε

  14. Proposed Method (cont.) • Indexing the trails (cont.) • Map a time series into a set of rectangles in feature space • Each MBR corresponds to a sub-trail

  15. Proposed Method (cont.) • For each MBR we have to store • , which are the offsets of the first and last such positionings • A unique identifier for each time series • The extent of the MBR in each dimension, i.e., • Store the MBRs in an R*-tree • Recursively group the MBRs into parent MBRs, grandparent MBRs, etc.

  16. Example1 (cont.) • assuming a fan-out of 4

  17. Proposed Method (cont.) • The structure of a leaf node and a non-leaf node

  18. Proposed Method (cont.) • Two questions • Insertions: when a new time series is inserted, what is a good way to divide its trail into sub-trails • Queries: how to handle queries, especially the ones that are longer than the sliding window

  19. Proposed Method (cont.) • Insertion – Dividing trails into sub-trails • Goal: Optimal division so that the number of disk accesses is minimized

  20. Example3fixed heuristic adaptive heuristic

  21. Proposed Method (cont.) • Insertion (cont.) • Group trail-points into sub-trails by means of an adaptive heuristic • Based on a greedy algorithm, using a cost function to estimate the number of disk accesses for each of the options

  22. Proposed Method (cont.) • Insertion (cont.) • The cost function:where is the sides of the n-dimensional MBR of a node in an R-tree • The marginal cost of each point: where k is the number of points in this MBR

  23. Proposed Method (cont.) • Insertion (cont.) • Algorithm:Assign the first point of the trail to a sub-trail (would be a predefined small MBR)FOR each successive point IF it increases the marginal cost of the current sub-trail THEN start a new sub-trail ELSE include it into the current sub-trail

  24. Proposed Method (cont.) • Insertion (cont.) • The algorithm may not work well under certain circumstances • The algorithm’s goal is to minimize the size of each MBR, why don’t we use clustering techniques!

  25. Proposed Method (cont.) • Searching – Queries longer than w • If Len(Q)=w, the searching algorithm goes like: • Map Q to a point q in the feature space; the query corresponds to a sphere with center q and radius ε • Retrieve the sub-trails whose MBRs intersect the query region • Examine the corresponding time series, and discard the false alarms

  26. Proposed Method (cont.) • Searching (cont.) • If Len(Q)>w, consider the following Lemma: • Consider two sequences Q and S of the same length Len(Q)=Len(S)=p*w • Consider their p disjoint subsequences andwhere • If Q AND S agree within tolerance ε, then at least one of the pairs of corresponding subsequence agree within tolerance

  27. Proposed Method (cont.) • Searching (cont.) • If Len(Q)>w, the searching algorithm goes like: • The query time series Q is broken into p sub-queries which correspond to p spheres in the feature space with radius • Retrieve the sub-trails whose MBRs intersect at least one of the sub-query regions • Examine the corresponding subsequences of the time series, and discard the false alarms

  28. Experiments • Experiments are ran on a stock prices database of 329,000 points • Only the first 3 frequencies of the DFT are used; thus the feature space has 6 dimensions (real and imaginary parts of each retained DFT coefficient) • Sliding window size w=512

  29. Experiments (cont.) • Query time series were generated by taking random offsets into the time series and obtaining subsequences of length Len(Q) from those offsets

  30. Experiments (cont.) • For groups of experiments were carried out • Comparison of the proposed method against the method that has sub-trails with only one point each • Experiments to compare the response time • Experiments with queries longer than w • Experiments with larger databases

  31. Related Works (citations) • Continuous queries over data streams • Similarity indexing with M-tree/SS-tree, etc. • Efficient time series matching by wavelets • Fast similarity search in the presence of noise, scaling, and translation in time-series databases

  32. Thank you!

More Related