1 / 17

i SAX: Indexing and Mining Terabyte Sized Time Series

i SAX: Indexing and Mining Terabyte Sized Time Series. Jin Shieh, Eamonn Keogh Computer Science & Eng. Dept. University of California, Riverside. Outline. Introduction Motivating example i SAX representation Indexing time series Experimental evaluation Conclusion. 3. 3. iSAX( T ,4,4).

anitra
Download Presentation

i SAX: Indexing and Mining Terabyte Sized Time Series

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. iSAX: Indexing and Mining Terabyte Sized Time Series Jin Shieh, Eamonn Keogh Computer Science & Eng. Dept. University of California, Riverside

  2. Outline • Introduction • Motivating example • iSAX representation • Indexing time series • Experimental evaluation • Conclusion

  3. 3 3 iSAX(T,4,4) 3 2 A time series T PAA(T,4) 2 00 2 1 1 01 1 0 10 0 0 11 -1 -1 -1 -2 -2 -2 -3 -3 0 4 8 12 16 0 0 4 8 12 16 4 8 12 16 -3 Introduction • Our work extends a popular symbolic representation of time series to allow for the indexing and retrieval of millions of time series • Symbolic Aggregate approXimation (SAX) • Represent a time series T of length n in w-dimensional space using PAA • Where the ith element of is: • Then discretize into a vector of symbols • Breakpoints map to a small alphabet a of symbols

  4. Introduction (cont.) • SAX is lower bounding • Given a SAX representations Ta, Saa lower bound to the Euclidean distance is: MINDIST(Ta, Sa) • dist(ti,si) is the smallest distance between the breakpoints that characterize each symbol, 0 if they overlap

  5. Motivating Example • Why not just index using SAX? • For example: index 1,000,000 time series using SAX • Choose SAX parameters • cardinality = 8, wordlength = 4 • 84 = 4,096 possible SAX word labels • Place time series which map to the same label in the same file on disk • Compute label for query and retrieve matching file • Time series in file likely to be good approximate matches • Average label occupancy 1,000,000/4,096 = ~244 (reasonable)

  6. Motivating Example (cont.) • In practice, the distribution of time series to SAX word labels is not uniform! • Empty • Disproportionate percentage of the dataset • Ideal condition: We want to give a threshold th, and have the number of entries n mapped to a label to be 1 ≤ n ≤ th • Favor larger n • How can we achieve this? We need to make SAX more flexible

  7. iSAX Representation • SAX uses a single hard-coded cardinality • Unable to differentiate only on dimensions of interest • We will show that the indexing problem can be solved if we extend SAX to allow: • Different cardinalities within a single word • Comparison of words with different cardinalities • We call this extension indexable SAX (iSAX)

  8. iSAX Representation (cont.) • Multi-resolution property • Readily convert to any lower resolution that differs by a power of two • Lower bounding distance between iSAX words enforced through examination of both sets of breakpoints • iSAX offers a bit aware, quantized, multi-resolution representation with variable granularity

  9. Indexingwith iSAX • Split a set of time series represented by a common iSAX word into mutually exclusive subsets (using multi-resolution property): • Increase cardinality along dimensions d, word length w, 1 ≤ d ≤ w • Fan-out rate bound by 2d • Iterative doubling • Given a base cardinality b, cardinality at i-th increase is b*2i • Alignment of breakpoints overlap • Allows for index structures which are hierarchical, with non-overlapping regions, and a controlled fan-out rate

  10. Indexingwith iSAX (cont.) • Simple tree-based index (base cardinality b, word length w, threshold th) • Hierarchically subdivides SAX space until entries in each subspace falls within th • Leaf nodes point to index files on disk • Internal nodes designate a split in SAX space • Approximate Search • Similar time series often represented by same iSAX word • Traverse index until leaf • Match iSAX representation at each level • Apply heuristics if no match • Exact Search • Leverage approximate search • Prune search space • Lower bounding distance

  11. Experimental Evaluation • We conduct experiments to identify characteristics of the iSAX representation: • Tightness of the lower bound • Indexing performance on massive datasets • Applicability to data-mining algorithms

  12. iSAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA Koski ECG 0.8 0.6 TLB 0.4 0 500 1000 0.2 0 1920 1440 960 480 Bytes Available 40 bytes 32 bytes Time series Length 24 bytes Koski ECG dataset 16 bytes Tightness of Lower Bounds • TLB = LowerBoundDist(T’,S’) / EuclideanDist(T,S) • For a given dataset • Time series length [480, 960, 1440, 1920] • Bytes available for representation [16, 24, 32, 40] • Results similar across thirty datasets

  13. At least 1 from top 100 100 80 At least 1 from top 10 60 Percentage of Queries 40 1 from top 1 (true nearest neighbor) Outside top 1000 20 0 1m 2m 4m 8m Size of Random Walk Database Indexing Performance on Massive Datasets • Indexed random walk datasets of [1, 2, 4, 8] million time series of length 256 • Parameters: b = 4, w = 8, th = 100 • Generated [39,255, 57,365, 92,209, 162,340] index files • Approximate Search (1000 queries): Exact Search (100 queries):

  14. 4 0 90 109 -4 0 50 100 150 200 250 Data Mining • Definition: Time Series Set Difference (TSSD) (A,B). Given two collections of time series A and B, the time series set difference is the subsequence in A whose distance from its nearest neighbor in B is maximal • Electrocardiogram dataset from a 45 year old male subject with suspected sleep-disordered breathing • 7.2 hours as reference set B(1,000,000 time series) • 8 minutes 39 seconds as “novel” set A(20,000 time series) where the patient woke up The Time Series Set Difference discovered between ECGs recorded during a waking cycle and the previous 7.2 hours (respiration pattern change in accordance with change in sleep stages)

  15. Data Mining (cont.) • Solutions: • Sequential scan A across B • Exact search each entry in A using index on B • Leverage approximate and exact search • Order A by approximate search distance in a queue • Perform exact search using index on B in descending distance • Suspend if distance becomes lower than next entry in the queue • If search completes, return as TSSD

  16. Conclusion • Introduced the iSAX representation and shown how it can be used for indexing time series • Demonstrated scalability and efficacy on massive datasets • Showed how approximate and exact search can be used in conjunction to produce exact results on data mining problems

  17. THANK YOU!

More Related