1 / 31

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping. Presented by John Clark March 24, 2014. Paper Summary. Describes algorithmic changes to existing Dynamic Time Warping (DTW) calculations to increase search efficiency

eara
Download Presentation

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping Presented by John ClarkMarch 24, 2014

  2. Paper Summary • Describes algorithmic changes to existing Dynamic Time Warping (DTW) calculations to increase search efficiency • Focuses upon time-series data but demonstrates application to related data mining problems • Allows unprecedented levels of data to be searched quickly • Attempts to correct erroneous belief that DTW is too slow for general data mining

  3. Background • Time Series Data and Queries • Example Time Series • Dynamic Time Warping (DTW)

  4. Time Series Data and Queries • Time Series (T) • An ordered list of data points:T = t1, t2, ..., tm • Contains shorter subsequences • Subsequence (Ti,k) • Contiguous subset of time series data • Starts at position i of original series T with length k • Ti,k = ti, ti+1, ..., ti+k-11 <= i <= (m-k+1) • Candidate Subsequence (C) • Subsequence of T to match against known query • |C| = k • Query (Q) • Time series input • |Q| = n • Euclidean Distance (ED) • distance between Q and C where |Q| = |C| • ED (Q, C) =

  5. Example Time Series • Medical Data: EEG and ECG • Financial Data: stock prices, financial transactions • Web Data: clickstreams • Misc. Data: Video and audio sequences

  6. Dynamic Time Warping (DTW) • ED is a one-to-one mapping of two sequences • DTW allows for non-linear mapping between two sequences • Formulation • construct n x n matrix • (i, j) element = ED(qi, cj) for points qi and cj • Apply path constraint: Sakoe-Chiba Band • Find warping path • Warping Path (P) • contiguous set of matrix elements that defines a mapping between Q and C • pt = (i, j)t P = p1, p2, ..., pt, ..., pT n <= T <= 2n - 1

  7. Paper • Claims • Assumptions • Known Optimizations • UCR Suite • Experiments • Additional Applications • Conclusions

  8. Claims • Time series data mining bottleneck is similarity search time • Most time series work plateaus at millions of objects • Large datasets can be searched exactly with DTW more quickly than current state-of-the-art Euclidean distance search algorithms • The author’s tests used the largest set of time series data ever • Design applicable to other mining problems • Allow real-time monitoring • DTW myths abound • Exact search is faster than any current approximate or indexed searches

  9. Assumptions • Time Series Subsequences must be Normalized • Dynamic Time Warping is the best measure • No known distance measure better than DTW after search of 800 papers • Arbitrary Query Lengths cannot be Indexed • No known techniques support similarity search of arbitrary lengths in billion+ datasets • There exists data mining problems that we are willing to wait several hours to answer

  10. Time Series Subsequences must be Normalized • Intuitive idea but not always implemented • Example: analysis of video frames • normalized analysis error rate: 0.087 • non-normalized analysis error rates when offset and scaling of +/- 10% applied: 0.326 and 0.193 • analysis error rate off by at least 50% for offset/scale of +/- 5% using real data

  11. Known Optimizations • Using Squared Distance • removes expensive square root computation without changing relative rankings • Lower Bounding • Early Abandoning of ED and LB_Keogh • Early Abandoning of DTW • Exploiting Multicores • linear speedup

  12. Lower Bounding • Speed up sequential search by setting up a lower bound an pruning unpromising candidates • LB_kim (modified) O(1) • LB_Keogh O(n)

  13. Early Abandoning of ED and LB_Keogh • Include a best-so-far (BSF) value to aid in early termination • If sum of squared differences exceeds BSF, terminate computation

  14. Early Abandoning of DTW • Compute a full LB_Keogh lower bound • Compute DTW incrementallyto form a new lower bound • intermediate lower bound = DTW(Q1:k, C1:k) + LB_Keogh(Qk+1:n, Ck+1:n) • DTW(Q1:n, C1:n) >= intermediate lower bound • If (BSF < intermediate lower bound), abandon DTW

  15. UCR Suite • Early Abandoning Z-normalization • Reordering Early Abandoning • Reversing Query/Data Role in LB_Keogh • Cascading Lower Bounds

  16. Early Abandoning Z-Normalization • Normalization takes longer than computing the Euclidean Distance • Approach: interleave early abandoning of ED or LB_Keogh with online Z-normalization

  17. Reordering Early Abandoning • Traditionally compute distance / normalization in time-series order (left to right) • Approach • sort the indices based on absolute values of Z-normalized Q • compute distance / normalization with new order

  18. Reversing the Query/Data Role in LB_Keogh • Normally LB_Keogh is computed around the query • only needs to be done once  saves time and space • Proposal: compute lower bound on the candidate in a “just-in-time” fashion • calculate only if all other lower bounds fail to prune • removes space overhead • increased time overhead offset by increased pruning of full DTW calculations

  19. Cascading Lower Bounds • Multiple options for lower bounds • LB_KimFL, LB_KeoghEQ, LB_KeoghEC, Early Abandoning DTW • Suggestions is to use all of them in a cascading fashion to maximize the amount of pruning • Can prune more than 99.9999% of DTW calculations

  20. Experiments • Tests • Random Walk Baseline • Supporting Long Queries: EEG • Supporting Very Long Queries: DNA • Realtime Medical and Gesture Data • Algorithms • Naive: • Z-norm, ED / DTW at each step • State-of-the-art (SOTA) • Z-norm, early abandoning, LB_Keogh for DTW • UCR Suite • all speedups • GOd’sALgorithm (GOAL) • only maintains mean and std. dev. online O(1) • lower bound on fastest possible time

  21. Random Walk Results

  22. EEG Results

  23. DNA Results

  24. Real-time Medical and Gesture Data • 8,518,554,188 ECG datapoints sampled at 256 Hz

  25. Application of UCR to Existing Mining Algorithms

  26. Paper’s Discussion and Conclusions • Focused on fast sequential search • Believed to be faster than all known indexing searches • Shown that UCR-DTW is faster than all current Euclidean Distance Searches (SOTA-ED) • Reason: O(n) normalization step for each subsequence for ED; UCR-DTW weighted average is less than O(n) • Compare UCR method to recent SOTA embedding-based DTW search called EBSM

  27. EBSM Comparison to UCR-DTW

  28. Conclusions • Well written • easy to follow • clear distinction and explanation of modifications • thorough experimentation with available source and pseudo-code • Not terribly innovative but very effective • additions are straight-forward and surprising intuitive • execution / integration of components make this algorithm stand out • Deferred a lot of explanations and theory of existing components to cited papers

More Related