200 likes | 378 Views
i SAX: Indexing and Mining Terabyte Sized Time Series. Jin Shieh, Eamonn Keogh Computer Science & Eng. Dept. University of California, Riverside. Outline. Introduction Motivating example i SAX representation Indexing time series Experimental evaluation Conclusion. 3. 3. iSAX( T ,4,4).
E N D
iSAX: Indexing and Mining Terabyte Sized Time Series Jin Shieh, Eamonn Keogh Computer Science & Eng. Dept. University of California, Riverside
Outline • Introduction • Motivating example • iSAX representation • Indexing time series • Experimental evaluation • Conclusion
3 3 iSAX(T,4,4) 3 2 A time series T PAA(T,4) 2 00 2 1 1 01 1 0 10 0 0 11 -1 -1 -1 -2 -2 -2 -3 -3 0 4 8 12 16 0 0 4 8 12 16 4 8 12 16 -3 Introduction • Our work extends a popular symbolic representation of time series to allow for the indexing and retrieval of millions of time series • Symbolic Aggregate approXimation (SAX) • Represent a time series T of length n in w-dimensional space using PAA • Where the ith element of is: • Then discretize into a vector of symbols • Breakpoints map to a small alphabet a of symbols
Introduction (cont.) • SAX is lower bounding • Given a SAX representations Ta, Saa lower bound to the Euclidean distance is: MINDIST(Ta, Sa) • dist(ti,si) is the smallest distance between the breakpoints that characterize each symbol, 0 if they overlap
Motivating Example • Why not just index using SAX? • For example: index 1,000,000 time series using SAX • Choose SAX parameters • cardinality = 8, wordlength = 4 • 84 = 4,096 possible SAX word labels • Place time series which map to the same label in the same file on disk • Compute label for query and retrieve matching file • Time series in file likely to be good approximate matches • Average label occupancy 1,000,000/4,096 = ~244 (reasonable)
Motivating Example (cont.) • In practice, the distribution of time series to SAX word labels is not uniform! • Empty • Disproportionate percentage of the dataset • Ideal condition: We want to give a threshold th, and have the number of entries n mapped to a label to be 1 ≤ n ≤ th • Favor larger n • How can we achieve this? We need to make SAX more flexible
iSAX Representation • SAX uses a single hard-coded cardinality • Unable to differentiate only on dimensions of interest • We will show that the indexing problem can be solved if we extend SAX to allow: • Different cardinalities within a single word • Comparison of words with different cardinalities • We call this extension indexable SAX (iSAX)
iSAX Representation (cont.) • Multi-resolution property • Readily convert to any lower resolution that differs by a power of two • Lower bounding distance between iSAX words enforced through examination of both sets of breakpoints • iSAX offers a bit aware, quantized, multi-resolution representation with variable granularity
Indexingwith iSAX • Split a set of time series represented by a common iSAX word into mutually exclusive subsets (using multi-resolution property): • Increase cardinality along dimensions d, word length w, 1 ≤ d ≤ w • Fan-out rate bound by 2d • Iterative doubling • Given a base cardinality b, cardinality at i-th increase is b*2i • Alignment of breakpoints overlap • Allows for index structures which are hierarchical, with non-overlapping regions, and a controlled fan-out rate
Indexingwith iSAX (cont.) • Simple tree-based index (base cardinality b, word length w, threshold th) • Hierarchically subdivides SAX space until entries in each subspace falls within th • Leaf nodes point to index files on disk • Internal nodes designate a split in SAX space • Approximate Search • Similar time series often represented by same iSAX word • Traverse index until leaf • Match iSAX representation at each level • Apply heuristics if no match • Exact Search • Leverage approximate search • Prune search space • Lower bounding distance
Experimental Evaluation • We conduct experiments to identify characteristics of the iSAX representation: • Tightness of the lower bound • Indexing performance on massive datasets • Applicability to data-mining algorithms
iSAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA Koski ECG 0.8 0.6 TLB 0.4 0 500 1000 0.2 0 1920 1440 960 480 Bytes Available 40 bytes 32 bytes Time series Length 24 bytes Koski ECG dataset 16 bytes Tightness of Lower Bounds • TLB = LowerBoundDist(T’,S’) / EuclideanDist(T,S) • For a given dataset • Time series length [480, 960, 1440, 1920] • Bytes available for representation [16, 24, 32, 40] • Results similar across thirty datasets
At least 1 from top 100 100 80 At least 1 from top 10 60 Percentage of Queries 40 1 from top 1 (true nearest neighbor) Outside top 1000 20 0 1m 2m 4m 8m Size of Random Walk Database Indexing Performance on Massive Datasets • Indexed random walk datasets of [1, 2, 4, 8] million time series of length 256 • Parameters: b = 4, w = 8, th = 100 • Generated [39,255, 57,365, 92,209, 162,340] index files • Approximate Search (1000 queries): Exact Search (100 queries):
4 0 90 109 -4 0 50 100 150 200 250 Data Mining • Definition: Time Series Set Difference (TSSD) (A,B). Given two collections of time series A and B, the time series set difference is the subsequence in A whose distance from its nearest neighbor in B is maximal • Electrocardiogram dataset from a 45 year old male subject with suspected sleep-disordered breathing • 7.2 hours as reference set B(1,000,000 time series) • 8 minutes 39 seconds as “novel” set A(20,000 time series) where the patient woke up The Time Series Set Difference discovered between ECGs recorded during a waking cycle and the previous 7.2 hours (respiration pattern change in accordance with change in sleep stages)
Data Mining (cont.) • Solutions: • Sequential scan A across B • Exact search each entry in A using index on B • Leverage approximate and exact search • Order A by approximate search distance in a queue • Perform exact search using index on B in descending distance • Suspend if distance becomes lower than next entry in the queue • If search completes, return as TSSD
Conclusion • Introduced the iSAX representation and shown how it can be used for indexing time series • Demonstrated scalability and efficacy on massive datasets • Showed how approximate and exact search can be used in conjunction to produce exact results on data mining problems