320 likes | 329 Views
Discover hidden variables to summarize main trends, forecast outliers, and efficiently compute incremental, real-time summaries in multiple time-series data streams.
E N D
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University
Motivation • Co-evolving time series (data streams) appear in many different applications—e.g.: • Disk access traffic in network clusters • Internet flow traffic in a network • Temperatures in a large building • Chlorine concentration in water distribution network Values are typically correlated Would be very useful if we could summarize them on the fly http://www.pdl.cmu.edu/
Phase 1 Phase 2 Phase 3 : : : : : : chlorine concentrations : : : : : : Example sensors near leak sensors away from leak water distribution network normal operation time http://www.pdl.cmu.edu/
Goals • Discover “hidden” (latent) variables for: • Summarization of main trends for users • Efficient forecasting, spotting outliers/anomalies • Incremental, real-time computation • Limited memory requirements http://www.pdl.cmu.edu/
Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : Example: chlorine measurements sensors near leak chlorine concentrations sensors away from leak water distribution network normal operation major leak http://www.pdl.cmu.edu/
Phase 1 Phase 1 : : : : : : chlorine concentrations k = 1 : : : : : : Example: hidden variable actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends http://www.pdl.cmu.edu/
: : : : : : : : : : : : Example: hidden variable tracking Phase 1 Phase 2 Phase 1 Phase 2 chlorine concentrations k = 2 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends http://www.pdl.cmu.edu/
: : : : : : : : : : : : Example: hidden variable tracking Phase 1 Phase 2 Phase 3 Phase 1 Phase 2 Phase 3 chlorine concentrations k = 1 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends http://www.pdl.cmu.edu/
Method outline • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? http://www.pdl.cmu.edu/
time 1. How to capture correlations? • First sensor 30oC Temperature T1 20oC http://www.pdl.cmu.edu/
time 1. How to capture correlations? • First sensor • Second sensor 30oC Temperature T2 20oC http://www.pdl.cmu.edu/
1. How to capture correlations • Correlations: • Let’s take a closer look at the first three value-pairs… 30oC Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1
time=3 time=2 time=1 1. How to capture correlations • First three lie (almost) on a line in the space of value-pairs… 30oC Temperature T2 offset = “hidden variable” O(n) numbers for the slope, and One number for each value-pair (offset on line) 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1
1. How to capture correlations • Other pairs also follow the same pattern: they lie (approximately) on this line 30oC Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1
Method outline • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? http://www.pdl.cmu.edu/
Experiments: chlorine concentration from sensor Measurements Reconstruction From hidden variables 166 streams 2 hidden variables (~4% error) [CMU Civil Engineering] http://www.pdl.cmu.edu/
Experiments: chlorine concentration • Both capture global, periodic pattern • Second: ~ first, but “phase-shifted” • Can express any “phase-shift”… hidden variables [CMU Civil Engineering] http://www.pdl.cmu.edu/
Conclusion • Many settings with hundreds of streams, but • Stream values are, by nature, related • We proposed a method to • discover hidden variables as summarization of main trends for users • require only incremental computation without buffering of any past data • Future work: • Apply on more applications: e.g, performance monitoring for storage system, network system. http://www.pdl.cmu.edu/
Related work • Stream SVD [Guha, Gunopulos, Koudas / KDD03] • StatStream [Zhu, Shasha / VLDB02] • Clustering • [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], • [Lin, Vlachos, Keogh, Gunopulos / EDBT04], • Classification • [Wang, Fan, et al/KDD03], [Hulten,Spencer,Domingos/KDD01] • Piecewise approximations • [Palpanas, Vlachos, Keogh, etal / ICDE 2004] http://www.pdl.cmu.edu/
Experiments: Light measurements measurement reconstruction 54 sensors 2-4 hidden variables (~6% error) http://www.pdl.cmu.edu/
Experiments: Light measurements • 1 & 2: main trend (as before) • 3 & 4: potential anomalies and outliers intermittent intermittent hidden variables http://www.pdl.cmu.edu/
Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? http://www.pdl.cmu.edu/
New value 2. Incremental update • For each new point • Project onto current line • Estimate error 30oC error Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1
New value 2. Incremental update • For each new point • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude • O(n) time 30oC error Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1
2. Incremental update • For each new point • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude 30oC Temperature T2 20oC 20oC 30oC http://www.pdl.cmu.edu/ Temperature T1
Stream correlationsPrincipal Component Analysis (PCA) • The “line” is the first principal component (PC) vector • This line is optimal: it minimizes the sum of squared projection errors http://www.pdl.cmu.edu/
x w1 updated e1 w1 y1 2. Incremental updateGiven number of hidden variables k • Assuming k is known • We know how to update the slope • (detailed equations in paper) • For each new point x and for i = 1, …, k : • yi := wiTx (proj. onto wi) • didi + yi2 (energy i-th eigenval.) • ei := x – yiwi (error) • wiwi + (1/di) yiei (update estimate) • xx – yiwi (repeat with remainder) http://www.pdl.cmu.edu/
Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust k, the number of hidden variables? http://www.pdl.cmu.edu/
3. Number of hidden variables • If we had three sensors with similar measurements • Again: points would lie on a line (i.e., one hidden variable, k=1), but in 3-D space T2 T3 T1 value-tuple space http://www.pdl.cmu.edu/
3. Number of hidden variables • Assume one sensor intermittently gets stuck • Now, no line can give a good approximation T2 T3 T1 value-tuple space http://www.pdl.cmu.edu/
3. Number of hidden variables • Assume one sensor intermittently gets stuck • Now, no line can give a good approximation • But a plane will do (two hidden variables, k = 2) T2 T3 T1 value-tuple space http://www.pdl.cmu.edu/
Number of hidden variables (PCs) • Keep track of energy maintained by approximation with k variables (PCs): • Reconstruction accuracy, w.r.t. total squared error • Increment (or decrement) k if fraction of energy maintained goes below (or above) a threshold • If below 95%, k k 1 • If above 98%, k k 1 http://www.pdl.cmu.edu/