1 / 34

Subsequence Matching in Time Series Databases

Subsequence Matching in Time Series Databases. Xiaojin Xu 04-25-2006. Papers. Online Event driven Subsequence Matching over Financial Data Streams Huanmei Wu, Betty Salzberg, Donghui Zhang Fast Subsequence Matching in Time-Series Databases C. Faloutsos, M. Ranganathan, Y. Manolopoulos.

Download Presentation

Subsequence Matching in Time Series Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Subsequence Matching in Time Series Databases Xiaojin Xu 04-25-2006

  2. Papers • Online Event driven Subsequence Matching over Financial Data Streams • Huanmei Wu, Betty Salzberg, Donghui Zhang • Fast Subsequence Matching in Time-Series Databases • C. Faloutsos, M. Ranganathan, Y. Manolopoulos

  3. Challenges of Subsequence Matching over Financial Data Streams • Existing techniques of Subsequence Matching • Mainly focus on discovering the similarity between an online querying subsequence and a traditional database • Queried data are static • Subsequence Similarities of Financial Data Streams • Data changing constantly, single pass search required • Movement can be predicted by observing a repetitive pattern of waves (zigzag shapes) • The relative position of the upper and lower end points is important in subsequence similarity. • Subsequence similarity should be flexible with regard to time shifting and scaling, amplitude rescaling…

  4. Our online event-driven subsequence matching meets the requirements of financial data analysis • Database is a dynamic stream database which stores recent financial data. • 3-tier online segmentation and pruning • Similarity measure: distance function is defined based on a permutation of the subsequence • Event-driven matching over an up-to-date database: query will be carried out only when there is a new end point • A new definition of trend for financial data stream

  5. Processing Online Data Stream • Translating massive data streams into manageable data for database before matching • Aggregation and Smoothing • Piecewise linear representation • Online segmentation and pruning

  6. Aggregation and Smoothing • One unique value for each time instance over a fixed time interval • Use p-interval moving average to filter out noise and generate a clean trend signal • X(i) is the value for i = 1, 2, ..., n • n is the number of periods.

  7. Piecewise Linear Representation (PLR) • Segment over Bollinger Band Percent (%b) • %b indicator middle_band = p-period moving average upper_band = middle_band + 2*p-period standard deviation lower_band = middle_band - 2* p-period standard deviation %b =(close price – lower_band)/(upper_band – lower_band) • Advantages of %b indicator • Smoothed moving trend similar to the price movement • Normalized value of the real price. • Sensitive to price change

  8. Segmentation • Use a sliding window which • Can only contain at most m points • Begin after the last identified end point and end right before the current point • Only contain last m points if more than m points • Segmentation over b% finds a possible upper or lower end points in the current sliding window • Current point is Pj(Xj,tj), the upper point Pi(Xi,ti) is a point in the sliding window that satisfies: 1. Xi = max( X values of current sliding window ) 2. Xi > Xj + δ (δ is the given error threshold) 3. P (Xi,ti) is the last one satisfying the above two conditions

  9. Segmentation (Cont’d)

  10. Pruning • Purpose — smoothing over recently identified end points • Two step • Filter: Pruning on %b • Refinement: pruning on raw data stream • Pruning rule — If the absolute %b or raw data values of two adjacent end points differs by less than a certain value, that line segment should be removed.

  11. Pruning (Cont’d)

  12. Online segmentation and pruning • Whenever an upper/lower point is identified, the previous line segment is checked for pruning • First check the need for pruning on %b • If pruning on %b, no pruning on raw data is done. System waits for next stream data to come in • If no pruning on %b done, the same line segment is checked for pruning on raw data • Keep which point after pruning? • Compare the last end point with the third last end point. If upper points, the one with the larger value will be kept. Otherwise, keep the point with smaller value.

  13. Online segmentation and pruning

  14. Online segmentation and pruning • Strategy of identifying end points • a smaller threshold δs for segmentation over %b, to ensure the sensitivity and reduce delay • a larger threshold δpb for pruning over %b, to filter out noise • a separate δpd for pruning over raw stream data. • The online segmentation and pruning are running simultaneously. • At most three end points need to be kept for segmentation and pruning procedure • All the fixed end points are updated into the database in real time

  15. Permutation • Subsequence matching • Find the subsequence of end points that are similar to the query subsequence • Permutation • Stream of end points S = {(X1,t1), (X2,t2),…, (Xn,tn) }, divided into two subsets of upper and lower end points respectively, get S’ • S’ = {[(X1,t1), (X3,t3),…, (Xn-1,tn-1)], [(X2,t2), (X4,t4),…, (Xn,tn)]},Sort the X values of each subset, get S” • S” = {[Xi1,Xi3,…Xin-1], [Xi2,Xi4,…Xin]} where Xi1≤Xi3 ≤… ≤Xin-1, Xi2≤Xi4≤… ≤Xin, • {i1, i3 ,…, in-1, i2 , i4 ,…, in} is the permutation of S

  16. Subsequence Similarity • Definition: S = {(X1,t1), (X2,t2),…, (Xn,tn) }, S’ = {(X1’,t1’), (X2’,t2’),…, (Xn’,tn’) }, S and S’ are similar if two conditions are satisfied: (1) S and S’ have the same permutation (2) d(S,S’) < γ where • α,β, and γ≥0 and are user-defined parameters • Permutation provides flexibility of time scaling and amplitude rescaling

  17. Event­driven subsequence match • Stream data are massive, real time. Do similarity search after a fixed time period may lose potentially important information • Event — A new potential end point is being identified and no pruning is need. • Event-driven subsequence match • Performs subsequence similarity search automatically only when there is a new event. • Generated query subsequence is the most recent n fixed and potential end points • Advantage: Can reduce the huge computation burden while maintain sensitivity to changes

  18. Application-Trend Prediction • Trend of an end point: Tendency of the raw stream after k end points from the current end point E. (ε is a user defined parameter) If Ek.X≥E.X+ε E.trend = UP If Ek.X≤E.X- ε E.trend = DOWN If E.X - ε<Ek.X <E.X+ ε E.trend = NOTREND If Ek.does not exist, E.trend = UNDEFINED. • Predict trend of query event Subsequence similarity search returns a list of retrieved end points F(D) = (# of retrieved end points with trend D) / (total # of retrieved end points) ×100% if |F(UP) – F(DOWN)| < F(NOTREND) + λ predict NOTREND; else if F(UP) > F(DOWN) predict UP; else predict DOWN; (λ is a user defined threshold)

  19. Conclusion • The online simultaneous segmentation and pruning algorithm for PLR achieves quick identification of new end points yet maintains accurate segmentation • New similarity measure of a permutation and a distance function has better performance than measures based on Euclidean distance • Experiments demonstrated that event-driven search outperformed the searches with any fixed time period.

  20. Fast Subsequence Matching in Time-Series Databases • Whole matching • Given N data sequences of S1, S2, …, SN and a query sequence Q, find those sequences that are within distance ε from Q. Si and Q have the same length. • Subsequence matching • Given N data sequences of S1, S2, …, SN of arbitrary lengths, a query sequence Q and a tolerance ε, try to find data sequences Si that containing matching subsequences( with distance < ε from Q)

  21. Whole matching • Use a distance preserving transform( e.g. DFT) to extract f features from sequences • Map f features into points in the f-dimensional feature space. • Use spatial access method ( e.g. R*-tree) to search for range/approximate query. • Precondition: data sequences and query sequences all have the same length

  22. Defined Subsequence Matching • Given N data sequences of real numbers S1, S2, …, SN of potentially deferent lengths • The user specifies query subsequence Q of length Len(Q) and the tolerance ε (maximum distance) • Try to find quickly all the sequences Si and the correct offsets k, such that the subsequence Si[k: k+Len(Q)-1] matches the query sequence: D(Q , Si[k: k+Len(Q)-1] ≤ ε • Sequential Scan is not efficient for space/time overhead

  23. ST-index • Assume the minimum query length is w • Use a sliding window of size w and place it at every possible position on every data sequence • Extract the features of subsequence inside the window for each placement • A data sequence of length Len(S) is mapped to a trail in feature space • The trail consists of Len(S)-w+1points. Each point represent each possible offset of the sliding window

  24. How to index the trails • A straightforward way — I-naive • keep track of the individual points of each trail and store them in a spatial access method • Problem • Storing the individual points of trail in an R*-tree is inefficient in space and speed • Almost every point in a data sequence will correspond to a point in the f-dimensional feature space. 1: f increase for storage.

  25. MBR • Divide the trail into sub-trails. Each sub-trail is represented with minimum bounding (hyper)-rectangle (MBR). • Only a few MBRs need to be stored. • When a query arrives, retrieve all the MBRs that intersect the query region. • Some false alarms are included(their MBR intersect the query region, but the sub-trails do not) • MBRs belonging to the same trail may overlap

  26. MBR(Cont’d) • Information of MBR • tstart, tend: offsets of first and last positionings • sequence_id: unique identifier of the data sequence • (F1low,F1high,F2low,F2high,…) : extent of the MBR

  27. MBR(Cont’d) • Group the MBRs to form MBRs at higher level • None-leaf nodes do not store sequence_id or offsets

  28. Insertion – How to divide trails into sub-trails • I-fixed method • Sub-trail size is fixed number or a simple function of Len(S) • Resulting MBRs are not good.

  29. I-adaptive method • Goal: Adapt to the distribution of points of the trail • Cost function • L = (L1, L2 ,…, Ln) : sides of n-dimensional MBR of a node in an R-tree • Average number of disk accesses DA(L) • Marginal cost of each point in the sub-trail of k points with the MBR • mc = DA(L)/k

  30. I-adaptive method: Algorithm • Assign the first point of the trail in a trivial sub-trail • FOR each successive point • IF it increases the marginal cost of the current sub-trail • THEN start another sub-trail • ELSE include it in the current sub-trail

  31. Searching : Len(Q) = w • Q is mapped to a point qf in feature space; the query corresponds to a sphere in feature space with center qf and radius ε ; • Retrieve the sub-trails whose MBRs intersect the query region using our index • Examine the corresponding subsequences of the data sequences to discard the false alarms

  32. Searching : Len(Q) = pw • If Q and S agree within tolerance ε, then at least one of the pairs (si, qi) of corresponding subsequences agree within tolerance ε/ ; • Q is broken into p sub-queries which corresponds to p spheres in feature space with ε/ ; • Retrieve the sub-trails whose MBRs intersect at least one sub-query region using ST-index • Examine the corresponding subsequences of the data sequences to discard the false alarms

  33. Conclusion • Designed a method that efficiently handles approximate queries for subsequence matching • Fulfill the following requirements: • Fast — Experiment results showed it achieves orders of magnitude savings over the sequential scanning • It requires small space overhead • It is dynamic • Correct : no false dismissals

  34. Thank you!

More Related