Streaming Pattern Discovery in Multiple Time-Series

Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University

Motivation • Co-evolving time series (data streams) appear in many different applications—e.g.: • Disk access traffic in network clusters • Internet flow traffic in a network • Temperatures in a large building • Chlorine concentration in water distribution network Values are typically correlated Would be very useful if we could summarize them on the fly http://www.pdl.cmu.edu/

Phase 1 Phase 2 Phase 3 : : : : : : chlorine concentrations : : : : : : Example sensors near leak sensors away from leak water distribution network normal operation time http://www.pdl.cmu.edu/

Goals • Discover “hidden” (latent) variables for: • Summarization of main trends for users • Efficient forecasting, spotting outliers/anomalies • Incremental, real-time computation • Limited memory requirements http://www.pdl.cmu.edu/

Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : Example: chlorine measurements sensors near leak chlorine concentrations sensors away from leak water distribution network normal operation major leak http://www.pdl.cmu.edu/

Phase 1 Phase 1 : : : : : : chlorine concentrations k = 1 : : : : : : Example: hidden variable actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends http://www.pdl.cmu.edu/

: : : : : : : : : : : : Example: hidden variable tracking Phase 1 Phase 2 Phase 1 Phase 2 chlorine concentrations k = 2 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends http://www.pdl.cmu.edu/

: : : : : : : : : : : : Example: hidden variable tracking Phase 1 Phase 2 Phase 3 Phase 1 Phase 2 Phase 3 chlorine concentrations k = 1 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends http://www.pdl.cmu.edu/

Method outline • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? http://www.pdl.cmu.edu/

time 1. How to capture correlations? • First sensor 30oC Temperature T1 20oC http://www.pdl.cmu.edu/

time 1. How to capture correlations? • First sensor • Second sensor 30oC Temperature T2 20oC http://www.pdl.cmu.edu/

1. How to capture correlations • Correlations: • Let’s take a closer look at the first three value-pairs… 30oC Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

time=3 time=2 time=1 1. How to capture correlations • First three lie (almost) on a line in the space of value-pairs… 30oC Temperature T2 offset = “hidden variable”  O(n) numbers for the slope, and  One number for each value-pair (offset on line) 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

1. How to capture correlations • Other pairs also follow the same pattern: they lie (approximately) on this line 30oC Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

Method outline • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? http://www.pdl.cmu.edu/

Experiments: chlorine concentration from sensor Measurements Reconstruction From hidden variables 166 streams 2 hidden variables (~4% error) [CMU Civil Engineering] http://www.pdl.cmu.edu/

Experiments: chlorine concentration • Both capture global, periodic pattern • Second: ~ first, but “phase-shifted” • Can express any “phase-shift”… hidden variables [CMU Civil Engineering] http://www.pdl.cmu.edu/

Conclusion • Many settings with hundreds of streams, but • Stream values are, by nature, related • We proposed a method to • discover hidden variables as summarization of main trends for users • require only incremental computation without buffering of any past data • Future work: • Apply on more applications: e.g, performance monitoring for storage system, network system. http://www.pdl.cmu.edu/

Related work • Stream SVD [Guha, Gunopulos, Koudas / KDD03] • StatStream [Zhu, Shasha / VLDB02] • Clustering • [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], • [Lin, Vlachos, Keogh, Gunopulos / EDBT04], • Classification • [Wang, Fan, et al/KDD03], [Hulten,Spencer,Domingos/KDD01] • Piecewise approximations • [Palpanas, Vlachos, Keogh, etal / ICDE 2004] http://www.pdl.cmu.edu/

Experiments: Light measurements measurement reconstruction 54 sensors 2-4 hidden variables (~6% error) http://www.pdl.cmu.edu/

Experiments: Light measurements • 1 & 2: main trend (as before) • 3 & 4: potential anomalies and outliers intermittent intermittent hidden variables http://www.pdl.cmu.edu/

Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? http://www.pdl.cmu.edu/

New value 2. Incremental update • For each new point • Project onto current line • Estimate error 30oC error Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

New value 2. Incremental update • For each new point • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude • O(n) time 30oC error Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

2. Incremental update • For each new point • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude 30oC Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1

Stream correlationsPrincipal Component Analysis (PCA) • The “line” is the first principal component (PC) vector • This line is optimal: it minimizes the sum of squared projection errors http://www.pdl.cmu.edu/

x w1 updated e1 w1 y1 2. Incremental updateGiven number of hidden variables k • Assuming k is known • We know how to update the slope • (detailed equations in paper) • For each new point x and for i = 1, …, k : • yi := wiTx (proj. onto wi) • didi + yi2 (energy  i-th eigenval.) • ei := x – yiwi (error) • wiwi + (1/di) yiei (update estimate) • xx – yiwi (repeat with remainder) http://www.pdl.cmu.edu/

Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust k, the number of hidden variables? http://www.pdl.cmu.edu/

3. Number of hidden variables • If we had three sensors with similar measurements • Again: points would lie on a line (i.e., one hidden variable, k=1), but in 3-D space T2 T3 T1 value-tuple space http://www.pdl.cmu.edu/

3. Number of hidden variables • Assume one sensor intermittently gets stuck • Now, no line can give a good approximation T2 T3 T1 value-tuple space http://www.pdl.cmu.edu/

3. Number of hidden variables • Assume one sensor intermittently gets stuck • Now, no line can give a good approximation • But a plane will do (two hidden variables, k = 2) T2 T3 T1 value-tuple space http://www.pdl.cmu.edu/

Number of hidden variables (PCs) • Keep track of energy maintained by approximation with k variables (PCs): • Reconstruction accuracy, w.r.t. total squared error • Increment (or decrement) k if fraction of energy maintained goes below (or above) a threshold • If below 95%, k k  1 • If above 98%, k k  1 http://www.pdl.cmu.edu/

Streaming Pattern Discovery in Multiple Time-Series

Streaming Pattern Discovery in Multiple Time-Series

Presentation Transcript

Time Series

Exact Discovery of Time Series Motifs

Pattern Finding and Pattern Discovery in Time Series

Online Discovery of Group Level Events in Time Series

Exact Discovery of Time Series Motifs

Time Series

Temporal Pattern Discovery in Smart Homes

Algorithms for pattern matching and pattern discovery in music

Autocorrelation in Time Series

Efficient Query Filtering for Streaming Time Series

Interactive Pattern Search in Time Series (Using TimeSearcher 2)

Regulatory element discovery for developmental time series

Pattern Discovery of Fuzzy Time Series for Financial Prediction

CyberBridges Protein Pattern Discovery

Discovery Education Streaming

DISCOVERY - DS Series

Pattern Matching in the streaming model

High Performance Algorithms for Multiple Streaming Time Series

Clustering of Streaming Time Series is Meaningless

Pattern Matching in the streaming model

Autocorrelation in Time Series

Efficient Query Filtering for Streaming Time Series