Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping Presented by John ClarkMarch 24, 2014

Paper Summary • Describes algorithmic changes to existing Dynamic Time Warping (DTW) calculations to increase search efficiency • Focuses upon time-series data but demonstrates application to related data mining problems • Allows unprecedented levels of data to be searched quickly • Attempts to correct erroneous belief that DTW is too slow for general data mining

Background • Time Series Data and Queries • Example Time Series • Dynamic Time Warping (DTW)

Time Series Data and Queries • Time Series (T) • An ordered list of data points:T = t1, t2, ..., tm • Contains shorter subsequences • Subsequence (Ti,k) • Contiguous subset of time series data • Starts at position i of original series T with length k • Ti,k = ti, ti+1, ..., ti+k-11 <= i <= (m-k+1) • Candidate Subsequence (C) • Subsequence of T to match against known query • |C| = k • Query (Q) • Time series input • |Q| = n • Euclidean Distance (ED) • distance between Q and C where |Q| = |C| • ED (Q, C) =

Example Time Series • Medical Data: EEG and ECG • Financial Data: stock prices, financial transactions • Web Data: clickstreams • Misc. Data: Video and audio sequences

Dynamic Time Warping (DTW) • ED is a one-to-one mapping of two sequences • DTW allows for non-linear mapping between two sequences • Formulation • construct n x n matrix • (i, j) element = ED(qi, cj) for points qi and cj • Apply path constraint: Sakoe-Chiba Band • Find warping path • Warping Path (P) • contiguous set of matrix elements that defines a mapping between Q and C • pt = (i, j)t P = p1, p2, ..., pt, ..., pT n <= T <= 2n - 1

Paper • Claims • Assumptions • Known Optimizations • UCR Suite • Experiments • Additional Applications • Conclusions

Claims • Time series data mining bottleneck is similarity search time • Most time series work plateaus at millions of objects • Large datasets can be searched exactly with DTW more quickly than current state-of-the-art Euclidean distance search algorithms • The author’s tests used the largest set of time series data ever • Design applicable to other mining problems • Allow real-time monitoring • DTW myths abound • Exact search is faster than any current approximate or indexed searches

Assumptions • Time Series Subsequences must be Normalized • Dynamic Time Warping is the best measure • No known distance measure better than DTW after search of 800 papers • Arbitrary Query Lengths cannot be Indexed • No known techniques support similarity search of arbitrary lengths in billion+ datasets • There exists data mining problems that we are willing to wait several hours to answer

Time Series Subsequences must be Normalized • Intuitive idea but not always implemented • Example: analysis of video frames • normalized analysis error rate: 0.087 • non-normalized analysis error rates when offset and scaling of +/- 10% applied: 0.326 and 0.193 • analysis error rate off by at least 50% for offset/scale of +/- 5% using real data

Known Optimizations • Using Squared Distance • removes expensive square root computation without changing relative rankings • Lower Bounding • Early Abandoning of ED and LB_Keogh • Early Abandoning of DTW • Exploiting Multicores • linear speedup

Lower Bounding • Speed up sequential search by setting up a lower bound an pruning unpromising candidates • LB_kim (modified) O(1) • LB_Keogh O(n)

Early Abandoning of ED and LB_Keogh • Include a best-so-far (BSF) value to aid in early termination • If sum of squared differences exceeds BSF, terminate computation

Early Abandoning of DTW • Compute a full LB_Keogh lower bound • Compute DTW incrementallyto form a new lower bound • intermediate lower bound = DTW(Q1:k, C1:k) + LB_Keogh(Qk+1:n, Ck+1:n) • DTW(Q1:n, C1:n) >= intermediate lower bound • If (BSF < intermediate lower bound), abandon DTW

UCR Suite • Early Abandoning Z-normalization • Reordering Early Abandoning • Reversing Query/Data Role in LB_Keogh • Cascading Lower Bounds

Early Abandoning Z-Normalization • Normalization takes longer than computing the Euclidean Distance • Approach: interleave early abandoning of ED or LB_Keogh with online Z-normalization

Reordering Early Abandoning • Traditionally compute distance / normalization in time-series order (left to right) • Approach • sort the indices based on absolute values of Z-normalized Q • compute distance / normalization with new order

Reversing the Query/Data Role in LB_Keogh • Normally LB_Keogh is computed around the query • only needs to be done once  saves time and space • Proposal: compute lower bound on the candidate in a “just-in-time” fashion • calculate only if all other lower bounds fail to prune • removes space overhead • increased time overhead offset by increased pruning of full DTW calculations

Cascading Lower Bounds • Multiple options for lower bounds • LB_KimFL, LB_KeoghEQ, LB_KeoghEC, Early Abandoning DTW • Suggestions is to use all of them in a cascading fashion to maximize the amount of pruning • Can prune more than 99.9999% of DTW calculations

Experiments • Tests • Random Walk Baseline • Supporting Long Queries: EEG • Supporting Very Long Queries: DNA • Realtime Medical and Gesture Data • Algorithms • Naive: • Z-norm, ED / DTW at each step • State-of-the-art (SOTA) • Z-norm, early abandoning, LB_Keogh for DTW • UCR Suite • all speedups • GOd’sALgorithm (GOAL) • only maintains mean and std. dev. online O(1) • lower bound on fastest possible time

Random Walk Results

EEG Results

DNA Results

Real-time Medical and Gesture Data • 8,518,554,188 ECG datapoints sampled at 256 Hz

Application of UCR to Existing Mining Algorithms

Paper’s Discussion and Conclusions • Focused on fast sequential search • Believed to be faster than all known indexing searches • Shown that UCR-DTW is faster than all current Euclidean Distance Searches (SOTA-ED) • Reason: O(n) normalization step for each subsequence for ED; UCR-DTW weighted average is less than O(n) • Compare UCR method to recent SOTA embedding-based DTW search called EBSM

EBSM Comparison to UCR-DTW

Conclusions • Well written • easy to follow • clear distinction and explanation of modifications • thorough experimentation with available source and pseudo-code • Not terribly innovative but very effective • additions are straight-forward and surprising intuitive • execution / integration of components make this algorithm stand out • Deferred a lot of explanations and theory of existing components to cited papers

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping

Presentation Transcript

Time Series and Dynamic Time Warping

Parallelizing Dynamic Time Warping

Using Dynamic Time Warping for Sleep and Wake Discrimination

Instruction Set Extension for Dynamic Time Warping

Dynamic Time Warping for Automated Cell Cycle Labelling

Mining Time Series Data

Exact indexing of Dynamic Time Warping

Dynamic Time Warping Applications and Derivation

FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space

Qualitative approximation to Dynamic Time Warping similarity between time series data

Exact Indexing of Dynamic Time Warping

Keyword Spotting Dynamic Time Warping

Dynamic Time Warping

BinX Dynamic exploration of time series

DYNAMIC TIME WARPING IN KEY WORD SPOTTING

Mining Time-Series Databases

Dynamic Time Warping (DTW)

Mining Time Series

BinX Dynamic exploration of time series