320 likes | 339 Views
F ast S ubsequence M atching in T ime -S eries D atabases. Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University of Maryland at College Park. Presented by Rui Li. Abstract.
E N D
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University of Maryland at College Park Presented by Rui Li
Abstract • Goal: To find an efficient indexing method to locate time series in a database • Main Idea: • Map each time series into a small set of multidimensional rectangles in feature space • Rectangles can be readily indexed using traditional spatial access methods, e.g., R*-tree
Introduction • Hot Problem: Searching similar patterns in time-series databases • Applications: • financial, marketing and production time series, e.g. stock prices • scientific databases, e.g. weather, geological, environmental data
Introduction (cont.) • Similarity Queries: • Whole Matching • Subsequence Matching • partial matching • report time series along with offset
Introduction (cont.) • Whole Matching (Previous Work) • Use a distance-preserving transform (e.g., DFT) to extract f features from time series (e.g., the first f DFT coefficients), and then map them into points in the f-dimensional feature space • Spatial access method (e.g., R*-trees) can be used to search for approximate queries
Introduction (cont.) • Subsequence Matching (Goal) • Map time series into rectangles in feature space • Spatial access methods as the eventual indexing mechanism
Background • To guarantee no false dismissals for range queries, the feature extraction function F() should satisfy the following formula: • Parseval Theorem: • The DFT preserves the Euclidean distance between two time series
Proposed Method • Mapping each time series to a trail in feature space • Use a sliding window of size w and place it at every possible offset • For each such placement of the window, extract the features of the subsequence inside the window • A time series of length L is mapped to a trail in feature space, consisting of L-w+1 points: one point for each offset
Example2 (a) a sample stock-price time series (b) its trail in the feature space of the 0-th and 1-st DFT coefficients (c) its trail of the 1-st and 2-nd DFT coefficients
Proposed Method (cont.) • Indexing the trails • Simply storing the individual points of the trail in an R*-tree is inefficient • Exploit the fact that successive points of the trail tend to be similar, i.e., the contents of the sliding window in nearby offsets tend to be similar • Divide the trail into sub-trails and represent each of them with its minimum bounding (hyper)-rectangle (MBR) • Store only a few MBRs
Proposed Method (cont.) • Indexing the trails (cont.) • Can guarantee ‘no false dismissals’: when a query arrives, all the MBRs that intersect the query region are retrieved, i.e., all the qualifying sub-trails are retrieved, plus some false alarms
Proposed Method (cont.) • Indexing the trails (cont.) • Map a time series into a set of rectangles in feature space • Each MBR corresponds to a sub-trail
Proposed Method (cont.) • For each MBR we have to store • , which are the offsets of the first and last such positionings • A unique identifier for each time series • The extent of the MBR in each dimension, i.e., • Store the MBRs in an R*-tree • Recursively group the MBRs into parent MBRs, grandparent MBRs, etc.
Example1 (cont.) • assuming a fan-out of 4
Proposed Method (cont.) • The structure of a leaf node and a non-leaf node
Proposed Method (cont.) • Two questions • Insertions: when a new time series is inserted, what is a good way to divide its trail into sub-trails • Queries: how to handle queries, especially the ones that are longer than the sliding window
Proposed Method (cont.) • Insertion – Dividing trails into sub-trails • Goal: Optimal division so that the number of disk accesses is minimized
Proposed Method (cont.) • Insertion (cont.) • Group trail-points into sub-trails by means of an adaptive heuristic • Based on a greedy algorithm, using a cost function to estimate the number of disk accesses for each of the options
Proposed Method (cont.) • Insertion (cont.) • The cost function:where is the sides of the n-dimensional MBR of a node in an R-tree • The marginal cost of each point: where k is the number of points in this MBR
Proposed Method (cont.) • Insertion (cont.) • Algorithm:Assign the first point of the trail to a sub-trail (would be a predefined small MBR)FOR each successive point IF it increases the marginal cost of the current sub-trail THEN start a new sub-trail ELSE include it into the current sub-trail
Proposed Method (cont.) • Insertion (cont.) • The algorithm may not work well under certain circumstances • The algorithm’s goal is to minimize the size of each MBR, why don’t we use clustering techniques!
Proposed Method (cont.) • Searching – Queries longer than w • If Len(Q)=w, the searching algorithm goes like: • Map Q to a point q in the feature space; the query corresponds to a sphere with center q and radius ε • Retrieve the sub-trails whose MBRs intersect the query region • Examine the corresponding time series, and discard the false alarms
Proposed Method (cont.) • Searching (cont.) • If Len(Q)>w, consider the following Lemma: • Consider two sequences Q and S of the same length Len(Q)=Len(S)=p*w • Consider their p disjoint subsequences andwhere • If Q AND S agree within tolerance ε, then at least one of the pairs of corresponding subsequence agree within tolerance
Proposed Method (cont.) • Searching (cont.) • If Len(Q)>w, the searching algorithm goes like: • The query time series Q is broken into p sub-queries which correspond to p spheres in the feature space with radius • Retrieve the sub-trails whose MBRs intersect at least one of the sub-query regions • Examine the corresponding subsequences of the time series, and discard the false alarms
Experiments • Experiments are ran on a stock prices database of 329,000 points • Only the first 3 frequencies of the DFT are used; thus the feature space has 6 dimensions (real and imaginary parts of each retained DFT coefficient) • Sliding window size w=512
Experiments (cont.) • Query time series were generated by taking random offsets into the time series and obtaining subsequences of length Len(Q) from those offsets
Experiments (cont.) • For groups of experiments were carried out • Comparison of the proposed method against the method that has sub-trails with only one point each • Experiments to compare the response time • Experiments with queries longer than w • Experiments with larger databases
Related Works (citations) • Continuous queries over data streams • Similarity indexing with M-tree/SS-tree, etc. • Efficient time series matching by wavelets • Fast similarity search in the presence of noise, scaling, and translation in time-series databases