320 likes | 550 Views
Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping. Presented by John Clark March 24, 2014. Paper Summary. Describes algorithmic changes to existing Dynamic Time Warping (DTW) calculations to increase search efficiency
E N D
Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping Presented by John ClarkMarch 24, 2014
Paper Summary • Describes algorithmic changes to existing Dynamic Time Warping (DTW) calculations to increase search efficiency • Focuses upon time-series data but demonstrates application to related data mining problems • Allows unprecedented levels of data to be searched quickly • Attempts to correct erroneous belief that DTW is too slow for general data mining
Background • Time Series Data and Queries • Example Time Series • Dynamic Time Warping (DTW)
Time Series Data and Queries • Time Series (T) • An ordered list of data points:T = t1, t2, ..., tm • Contains shorter subsequences • Subsequence (Ti,k) • Contiguous subset of time series data • Starts at position i of original series T with length k • Ti,k = ti, ti+1, ..., ti+k-11 <= i <= (m-k+1) • Candidate Subsequence (C) • Subsequence of T to match against known query • |C| = k • Query (Q) • Time series input • |Q| = n • Euclidean Distance (ED) • distance between Q and C where |Q| = |C| • ED (Q, C) =
Example Time Series • Medical Data: EEG and ECG • Financial Data: stock prices, financial transactions • Web Data: clickstreams • Misc. Data: Video and audio sequences
Dynamic Time Warping (DTW) • ED is a one-to-one mapping of two sequences • DTW allows for non-linear mapping between two sequences • Formulation • construct n x n matrix • (i, j) element = ED(qi, cj) for points qi and cj • Apply path constraint: Sakoe-Chiba Band • Find warping path • Warping Path (P) • contiguous set of matrix elements that defines a mapping between Q and C • pt = (i, j)t P = p1, p2, ..., pt, ..., pT n <= T <= 2n - 1
Paper • Claims • Assumptions • Known Optimizations • UCR Suite • Experiments • Additional Applications • Conclusions
Claims • Time series data mining bottleneck is similarity search time • Most time series work plateaus at millions of objects • Large datasets can be searched exactly with DTW more quickly than current state-of-the-art Euclidean distance search algorithms • The author’s tests used the largest set of time series data ever • Design applicable to other mining problems • Allow real-time monitoring • DTW myths abound • Exact search is faster than any current approximate or indexed searches
Assumptions • Time Series Subsequences must be Normalized • Dynamic Time Warping is the best measure • No known distance measure better than DTW after search of 800 papers • Arbitrary Query Lengths cannot be Indexed • No known techniques support similarity search of arbitrary lengths in billion+ datasets • There exists data mining problems that we are willing to wait several hours to answer
Time Series Subsequences must be Normalized • Intuitive idea but not always implemented • Example: analysis of video frames • normalized analysis error rate: 0.087 • non-normalized analysis error rates when offset and scaling of +/- 10% applied: 0.326 and 0.193 • analysis error rate off by at least 50% for offset/scale of +/- 5% using real data
Known Optimizations • Using Squared Distance • removes expensive square root computation without changing relative rankings • Lower Bounding • Early Abandoning of ED and LB_Keogh • Early Abandoning of DTW • Exploiting Multicores • linear speedup
Lower Bounding • Speed up sequential search by setting up a lower bound an pruning unpromising candidates • LB_kim (modified) O(1) • LB_Keogh O(n)
Early Abandoning of ED and LB_Keogh • Include a best-so-far (BSF) value to aid in early termination • If sum of squared differences exceeds BSF, terminate computation
Early Abandoning of DTW • Compute a full LB_Keogh lower bound • Compute DTW incrementallyto form a new lower bound • intermediate lower bound = DTW(Q1:k, C1:k) + LB_Keogh(Qk+1:n, Ck+1:n) • DTW(Q1:n, C1:n) >= intermediate lower bound • If (BSF < intermediate lower bound), abandon DTW
UCR Suite • Early Abandoning Z-normalization • Reordering Early Abandoning • Reversing Query/Data Role in LB_Keogh • Cascading Lower Bounds
Early Abandoning Z-Normalization • Normalization takes longer than computing the Euclidean Distance • Approach: interleave early abandoning of ED or LB_Keogh with online Z-normalization
Reordering Early Abandoning • Traditionally compute distance / normalization in time-series order (left to right) • Approach • sort the indices based on absolute values of Z-normalized Q • compute distance / normalization with new order
Reversing the Query/Data Role in LB_Keogh • Normally LB_Keogh is computed around the query • only needs to be done once saves time and space • Proposal: compute lower bound on the candidate in a “just-in-time” fashion • calculate only if all other lower bounds fail to prune • removes space overhead • increased time overhead offset by increased pruning of full DTW calculations
Cascading Lower Bounds • Multiple options for lower bounds • LB_KimFL, LB_KeoghEQ, LB_KeoghEC, Early Abandoning DTW • Suggestions is to use all of them in a cascading fashion to maximize the amount of pruning • Can prune more than 99.9999% of DTW calculations
Experiments • Tests • Random Walk Baseline • Supporting Long Queries: EEG • Supporting Very Long Queries: DNA • Realtime Medical and Gesture Data • Algorithms • Naive: • Z-norm, ED / DTW at each step • State-of-the-art (SOTA) • Z-norm, early abandoning, LB_Keogh for DTW • UCR Suite • all speedups • GOd’sALgorithm (GOAL) • only maintains mean and std. dev. online O(1) • lower bound on fastest possible time
Real-time Medical and Gesture Data • 8,518,554,188 ECG datapoints sampled at 256 Hz
Paper’s Discussion and Conclusions • Focused on fast sequential search • Believed to be faster than all known indexing searches • Shown that UCR-DTW is faster than all current Euclidean Distance Searches (SOTA-ED) • Reason: O(n) normalization step for each subsequence for ED; UCR-DTW weighted average is less than O(n) • Compare UCR method to recent SOTA embedding-based DTW search called EBSM
Conclusions • Well written • easy to follow • clear distinction and explanation of modifications • thorough experimentation with available source and pseudo-code • Not terribly innovative but very effective • additions are straight-forward and surprising intuitive • execution / integration of components make this algorithm stand out • Deferred a lot of explanations and theory of existing components to cited papers