630 likes | 713 Views
Sensor and Graph Mining. Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos. Joint work with. Anthony Brockwell (CMU/Stat) Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Chenxi Wang (CMU) Yang Wang (CMU). Outline. Introduction - motivation
E N D
Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos C. Faloutsos
Joint work with • Anthony Brockwell (CMU/Stat) • Deepayan Chakrabarti (CMU) • Spiros Papadimitriou (CMU) • Chenxi Wang (CMU) • Yang Wang (CMU) C. Faloutsos
Outline Introduction - motivation Problem #1: Stream Mining Motivation Main idea Experimental results Problem #2: Graphs & Virus propagation Conclusions C. Faloutsos
Introduction • Sensor devices • Temperature, weather measurements • Road traffic data • Geological observations • Patient physiological data • Embedded devices • Network routers • Intelligent (active) disks C. Faloutsos
Limited resources Memory Bandwidth Power CPU Remote environments No human intervention Introduction C. Faloutsos
Introduction – problem dfn • Given a emi-infinite stream of values (time series) x1, x2, …, xt, … • Find patterns, forecasts, outliers… C. Faloutsos
“Noise”?? Introduction • E.g., Periodicity? (twice daily) Periodicity? (daily) C. Faloutsos
Periodicity? (twice daily) “Noise”?? Periodicity? (daily) Introduction • Can we capture these patterns • automatically • with limited resources? C. Faloutsos
Related workStatistics: Time series forecasting • Main problem: “[…] The first step in the analysis of any time series is to plot the data [and inspect the graph]”[Brockwell 91] • Typically: • Resource intensive • Cannot update online • AR(I)MA and seasonal variants • ARFIMA, GARCH, … C. Faloutsos
Related workDatabases: Continuous Queries • Typically, different focus: • “Compression” • Not generative models • Largely orthogonal problem… • Gilbert, Guha, Indyk et al. (STOC 2002) • Garofalakis, Gibbons (SIGMOD 2002) • Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003) • Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke et al. (SIGMOD 2002) • Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA 2002) • Madden+ [SIGMOD02], [SIGMOD03] C. Faloutsos
Goals • Adapt and handle arbitrary periodic components • No human intervention/tuning Also: • Single pass over the data • Limited memory (logarithmic) • Constant-time update C. Faloutsos
Outline Introduction - motivation Problem #1: Stream Mining Motivation Main idea Experimental results Problem #2: Graphs & Virus propagation Conclusions C. Faloutsos
xt t I1 I8 I2 I7 I3 I4 I5 I6 t t t t t t t t Wavelets“Straight” signal C. Faloutsos time
W1,3 W1,1 W1,4 W1,2 t t t t xt W2,1 W2,2 t t t W3,1 t V4,1 t WaveletsIntroduction – Haar frequency C. Faloutsos time
So? Wavelets compress many real signals well… Image compression and processing Vision; Astronomy, seismology, … Wavelet coefficients can be updated as new points arrive [Kotidis+] Wavelets C. Faloutsos
W1,3 t W1,1 W1,4 W1,2 t t t t W2,1 W2,2 = t t W3,1 t V4,1 t WaveletsCorrelations xt frequency C. Faloutsos time
W1,3 t W1,1 W1,4 W1,2 t t t t W2,1 W2,2 t t W3,1 t V4,1 t WaveletsCorrelations xt frequency C. Faloutsos time
Main ideaCorrelations • Wavelets are good… • …we can do even better • One number… • …and the fact that they are equal/correlated C. Faloutsos
Wl,t-2 Wl,t-1 Wl,t Wl’,t’-2 Wl’,t’-1 Proposed method Wl,t l,1Wl,t-1l,2Wl,t-2 … Wl’,t’ l’,1Wl’,t’-1l’,2Wl’,t’-2 … Wl’,t’ Small windows suffice… (k~4) C. Faloutsos
More details… • Update of wavelet coefficients • Update of linear models • Feature selection • Not all correlations are significant • Throw away the insignificant ones • very important!! [see paper] (incremental) (incremental; RLS) (single-pass) C. Faloutsos
SKIP Complexity • Model update Space:OlgN + mk2 OlgN Time:Ok2 O1 Where • N: number of points (so far) • k: number of regression coefficients; fixed • m: number of linear models; OlgN [see paper] C. Faloutsos
Outline Introduction - motivation Problem #1: Stream Mining Motivation Main idea Experimental results Problem #2: Graphs & Virus propagation Conclusions C. Faloutsos
Setup • First half used for model estimation • Models applied forward to forecast entire second half • AR, Seasonal AR (SAR): R • Simplest possible estimation – no maximum likelihood estimation (MLE), etc. • … vs. Python scripts C. Faloutsos
ResultsSynthetic data – Triangle pulse • Triangle pulse • AR captures wrong trend (or none) • Seasonal AR (SAR) estimation fails C. Faloutsos
ResultsSynthetic data – Mix • Mix (sine + square pulse) • AR captures wrong trend (or none) • Seasonal AR estimation fails C. Faloutsos
ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales (filtered) C. Faloutsos
ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales • AR fails to capture any trend (average) • Seasonal AR estimation fails C. Faloutsos
ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales • AWSOM spots periodicities, automatically C. Faloutsos
ResultsReal data – Automobile • Automobile traffic • Daily periodicity with rush-hour peaks • Bursty “noise” at smaller time scales • Generation with identified noise C. Faloutsos
ResultsReal data – Sunspot • Sunspot intensity – Slightly time-varying “period” • AR captures wrong trend (average) • Seasonal ARIMA • Captures immediate wrong downward trend • Requires human to determine seasonal component period (fixed) C. Faloutsos
ResultsReal data – Sunspot • Sunspot intensity – Slightly time-varying “period” Estimation: 40 minutes (R) vs. 9 seconds (Python) C. Faloutsos
~ 1 hour SKIP Variance • Variance (log-power) vs. scale: • “Noise” diagnostic (if decreasing linear…) • Can use to estimate noise parameters ~Hurst exponent C. Faloutsos
Running time time (t) stream size (N) C. Faloutsos
Space requirements Equal total number of model parameters C. Faloutsos
Conclusion • Adapt and handle arbitrary periodic components • No human intervention/tuning • Single pass over the data • Limited memory (logarithmic) • Constant-time update C. Faloutsos
Conclusion • Adapt and handle arbitrary periodic components • No human intervention/tuning • Single pass over the data • Limited memory (logarithmic) • Constant-time update no human limited resources C. Faloutsos
Outline Introduction - motivation Problem #1: Streams Problem #2: Graphs & Virus propagation Motivation & problem definition Related work Main idea Experiments Conclusions C. Faloutsos
Introduction Protein Interactions [genomebiology.com] Internet Map [lumeta.com] Food Web [Martinez ’91] ► Graphs are ubiquitious Friendship Network [Moody ’01] C. Faloutsos
Introduction “bridges” • What can we do with graph analysis? • Immunization; • Information Dissemination • network value of a customer [Domingos+] “Needle exchange” networks of drug users[Weeks et al. 2002] C. Faloutsos
Problem definition • Q1: How does a virus spread across an arbitrary network? • Q2: will it create an epidemic? • (in a sensor setting, with a ‘gossip’ protocol, will a rumor/query spread?) C. Faloutsos
Infected by neighbor Susceptible/ healthy Infected & infectious Cured internally Framework • Susceptible-Infected-Susceptible (SIS) model • Cured nodes immediately become susceptible C. Faloutsos
Prob. δ Prob. β Prob. β The model • (virus) Birth rate β : probability than an infected neighbor attacks • (virus) Death rate δ : probability that an infected node heals Healthy N2 N1 N Infected N3 C. Faloutsos
Epidemic threshold t Defined as the value of t, such that if b / d < t an epidemic can not happen Thus, • given a graph • compute its epidemic threshold C. Faloutsos
Epidemic threshold t What should t depend on? • avg. degree? and/or highest degree? • and/or variance of degree? • and/or determinant of the adjacency matrix? C. Faloutsos
Basic Homogeneous Model Homogeneous graphs [Kephart-White ’91, ’93] • Epidemic threshold = 1/<k> • Homogeneous connectivity <k>, ie, all nodes have ~same degree unrealistic C. Faloutsos
Power-law Networks • Model for Barabási-Albert networks • [Pastor-Satorras & Vespignani, ’01, ’02] • Epidemic threshold = <k> / <k2> • for BA type networks, with onlyγ = 3 (γ = slope of power-law exponent) C. Faloutsos
Epidemic threshold • Homogeneous graphs: 1/<k> • BA (g=3) <k> / <k2> • more complicated graphs ? • arbitrary, REAL graphs ? • how many parameters?? C. Faloutsos
Epidemic threshold • [Theorem] We have no epidemic, if β/δ <τ= 1/ λ1,A C. Faloutsos
Epidemic threshold • [Theorem] We have no epidemic, if epidemic threshold recovery prob. β/δ <τ= 1/ λ1,A largest eigenvalue of adj. matrix A attack prob. Proof: [Wang+03] C. Faloutsos
Epidemic threshold for various networks • sanity checks / older results: • Homogeneous networks • λ1,A = <k>; τ = 1/<k> • where <k> = average degree • This is the same result as of Kephart & White ! C. Faloutsos