950 likes | 1.15k Views
Data Mining on Streams. Christos Faloutsos CMU. THANK YOU!. Prof. Panos Ipeirotis Julia Mills. Outline. Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions. Problem definition - example.
E N D
Data Mining on Streams Christos Faloutsos CMU C. Faloutsos
THANK YOU! • Prof. Panos Ipeirotis • Julia Mills C. Faloutsos
Outline • Problem and motivation • Single-sequence mining: AWSOM • Co-evolving sequences: SPIRIT • Lag correlations: BRAID • Conclusions C. Faloutsos
Problem definition - example Each sensor collects data (x1, x2, …, xt, …) C. Faloutsos
Problem definition • Given: one or more sequences x1 , x2 , … , xt , … (y1, y2, … , yt, … … ) • Find • patterns; correlations; outliers • incrementally! C. Faloutsos
Find patterns using a method that is nimble: limited resources Memory Bandwidth, power, CPU incremental: on-line, ‘any-time’ response single pass (‘you get to see it only once’) automatic: no human intervention eg., in remote environments Limitations / Challenges C. Faloutsos
Application domains • Sensor devices • Temperature, weather measurements • Road traffic data • Geological observations • Patient physiological data • Embedded devices • Network routers • Intelligent (active) disks C. Faloutsos
Motivation - Applications (cont’d) • ‘Smart house’ • sensors monitor temperature, humidity, air quality • video surveillance C. Faloutsos
Motivation - Applications (cont’d) • civil/automobile infrastructure • bridge vibrations [Oppenheim+02] • road conditions / traffic monitoring C. Faloutsos
Motivation - Applications (cont’d) • Weather, environment/anti-pollution • volcano monitoring • air/water pollutant monitoring C. Faloutsos
Motivation - Applications (cont’d) • Computer systems • ‘Active Disks’ (buffering, prefetching) • web servers (ditto) • network traffic monitoring • ... C. Faloutsos
InteMonw/ Evan Hoke, Jimeng Sun self-* PetaByte data center at CMU
Outline • Problem and motivation • Single-sequence mining: AWSOM • Co-evolving sequences: SPIRIT • Lag correlations: BRAID • conclusions C. Faloutsos
Single sequence mining - AWSOM with Spiros Papadimitriou (CMU -> IBM) Anthony Brockwell (CMU/Stat) C. Faloutsos
“Noise”?? Problem definition • Semi-infinite streams of values (time series) x1, x2, …, xt, … • Find patterns, forecasts, outliers… Periodicity? (twice daily) C. Faloutsos Periodicity? (daily)
Requirements / Goals • Adapt and handle arbitrary periodic components and • nimble (limited resources, single pass) • on-line, any-time • automatic (no human intervention/tuning) C. Faloutsos
Overview • Introduction / Related work • Background • Main idea • Experimental results C. Faloutsos
W1,3 W1,1 W1,4 W1,2 t t t t xt W2,1 W2,2 t t t W3,1 t V4,1 t WaveletsExample – Haar transform “constant” frequency C. Faloutsos time
Wavelets compress many real signals well: Image compression and processing Vision Astronomy, seismology, … Wavelet coefficients can be updated as new points arrive WaveletsWhy we like them C. Faloutsos
Overview • Introduction / Related work • Background • Main idea • Experimental results C. Faloutsos
W1,3 t W1,1 W1,4 W1,2 t t t t frequency W2,1 W2,2 = t t W3,1 t V4,1 t time AWSOM xt C. Faloutsos
W1,3 t W1,1 W1,4 W1,2 t t t t frequency W2,1 W2,2 t t W3,1 t V4,1 t time AWSOM xt C. Faloutsos
Wl,t-2 Wl,t-1 Wl,t Wl’,t’-2 Wl’,t’-1 AWSOM - idea Wl,t l,1Wl,t-1l,2Wl,t-2 … Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 … Wl’,t’ C. Faloutsos
More details… • Update of wavelet coefficients • Update of linear models • Feature selection • Not all correlations are significant • Throw away the insignificant ones (“noise”) (incremental) (incremental; RLS) (single-pass) C. Faloutsos
? Complexity • Model update Space:OlgN + mk2 OlgN Time:Ok2 O1 Where • N: number of points (so far) • k: number of regression coefficients; fixed • m: number of linear models; OlgN C. Faloutsos
Overview • Introduction / Related work • Background • Main idea • Experimental results C. Faloutsos
Results - Synthetic data AWSOM AR Seasonal AR • Triangle pulse • Mix (sine + square) • AR captures wrong trend (or none) • Seasonal AR estimation fails C. Faloutsos
Results - Real data • Automobile traffic • Daily periodicity • Bursty “noise” at smaller scales • AR fails to capture any trend • Seasonal AR estimation fails C. Faloutsos
Results - real data • Sunspot intensity • Slightly time-varying “period” • AR captures wrong trend • Seasonal ARIMA • wrong downward trend, despite help by human! C. Faloutsos
Conclusions • Adapt and handle arbitrary periodic components and • nimble Limited memory (logarithmic) Constant-time update • on-line, any-time Single pass over the data • automatic: No human intervention/tuning C. Faloutsos
Outline • Problem and motivation • Single-sequence mining: AWSOM • Co-evolving sequences: SPIRIT • Lag correlations: BRAID • conclusions C. Faloutsos
Part 2 SPIRIT: Mining co-evolving streams [Papadimitriou, Sun, Faloutsos, VLDB05] C. Faloutsos
Motivation • Eg., chlorine concentration in water distribution network C. Faloutsos
Phase 1 Phase 2 Phase 3 : : : : : : chlorine concentrations : : : : : : Motivation water distribution network normal operation May have hundreds of measurements, but it is unlikely they are completely unrelated! C. Faloutsos
Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : Motivation sensors near leak chlorine concentrations sensors away from leak water distribution network normal operation major leak C. Faloutsos
Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : Motivation sensors near leak chlorine concentrations sensors away from leak water distribution network normal operation major leak C. Faloutsos
Phase 1 Phase 1 : : : : : : chlorine concentrations k = 1 : : : : : : Motivation actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends C. Faloutsos
: : : : : : : : : : : : Motivation Phase 1 Phase 2 Phase 1 Phase 2 chlorine concentrations k = 2 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends C. Faloutsos
: : : : : : : : : : : : Motivation Phase 1 Phase 2 Phase 3 Phase 1 Phase 2 Phase 3 chlorine concentrations k = 1 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends C. Faloutsos
Goals • Discover “hidden” (latent) variables for: • Summarization of main trends for users • Efficient forecasting, spotting outliers/anomalies and the usual: • nimble: Limited memory requirements • on-line, any-time: (single pass etc) • automatic: No special parameters to tune C. Faloutsos
Related workStream mining • Stream SVD [Guha, Gunopulos, Koudas / KDD03] • StatStream [Zhu, Shasha / VLDB02] • Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], • Classification [Wang, Fan, et al/KDD03], [Hulten,Spencer,Domingos/KDD01] C. Faloutsos
Related workStream mining • Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004] • Queries on streams [Dobra, Garofalakis, Gehrke, et al / SIGMOD02], [Madden, Franklin, Hellerstein, et al / OSDI02], [Considine, Li, Kollios, et al / ICDE04], [Hammad, Aref, Elmagarmid / SSDBM03] • … C. Faloutsos
OverviewPart 2 • Method • Experiments • Conclusions & Other work C. Faloutsos
Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? C. Faloutsos
time 1. How to capture correlations? First sensor 30oC Temperature t1 20oC C. Faloutsos
time 1. How to capture correlations? First sensor Second sensor 30oC Temperature t2 20oC C. Faloutsos
1. How to capture correlations Correlations: Let’s take a closer look at the first three value-pairs… 30oC Temperature t2 20oC 20oC 30oC C. Faloutsos Temperature t1
time=3 time=2 time=1 1. How to capture correlations First three lie (almost) on a line in the space of value-pairs… 30oC Temperature t2 offset = “hidden variable” O(n) numbers for the slope, and One number for each value-pair (offset on line) 20oC 20oC 30oC C. Faloutsos Temperature t1
1. How to capture correlations Other pairs also follow the same pattern: they lie (approximately) on this line 30oC Temperature t2 20oC 20oC 30oC C. Faloutsos Temperature t1
Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? C. Faloutsos