520 likes | 642 Views
Data Mining Meets Systems: Tools and Case Studies. Christos Faloutsos SCS CMU. Spiros Papadimitriou (CMU->IBM). Mengzhi Wang (CMU->Google). Thanks. Jimeng Sun (CMU -> IBM). Outline. Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining
E N D
Data Mining Meets Systems:Tools and Case Studies Christos Faloutsos SCS CMU
Spiros Papadimitriou (CMU->IBM) Mengzhi Wang (CMU->Google) Thanks Jimeng Sun (CMU -> IBM) C. Faloutsos
Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank C. Faloutsos
Problem #1: Goal: given a signal (eg., #bytes over time) Find: patterns, periodicities, and/or compress #bytes Bytes per 30’ (packets per day; earthquakes per year) time C. Faloutsos
Problem #1 • model bursty traffic • generate realistic traces • (Poisson does not work) # bytes Poisson time C. Faloutsos
Motivation • predict queue length distributions (e.g., to give probabilistic guarantees) • “learn” traffic, for buffering, prefetching, ‘active disks’, web servers C. Faloutsos
Q: any ‘pattern’? • Not Poisson • spike; silence; more spikes; more silence… • any rules? # bytes time C. Faloutsos
solution: self-similarity # bytes # bytes time time C. Faloutsos
But: • Q1: How to generate realistic traces; extrapolate? • Q2: How to estimate the model parameters? C. Faloutsos
Approach • Q1: How to generate a sequence, that is • bursty • self-similar • and has similar queue length distributions C. Faloutsos
Approach • A: ‘binomial multifractal’ [Wang+02] • ~ 80-20 ‘law’: • 80% of bytes/queries etc on first half • repeat recursively • b: bias factor (eg., 80%) C. Faloutsos
binary multifractals 20 80 C. Faloutsos
binary multifractals 20 80 C. Faloutsos
Parameter estimation • Q2: How to estimate the bias factor b? C. Faloutsos
Parameter estimation • Q2: How to estimate the bias factor b? • A: MANY ways [Crovella+96] • Hurst exponent • variance plot • even DFT amplitude spectrum! (‘periodogram’) • More robust: ‘entropy plot’ [Wang+02] Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic, ICDE 2002 C. Faloutsos
Entropy plot • Rationale: • burstiness: inverse of uniformity • entropy measures uniformity of a distribution • find entropy at several granularities, to see whether/how our distribution is close to uniform. C. Faloutsos
Entropy plot p1 p2 % of bytes here • Entropy E(n) after n levels of splits • n=1: E(1)= - p1 log2(p1)- p2 log2(p2) C. Faloutsos
Entropy plot p2,3 p2,2 p2,4 p2,1 • Entropy E(n) after n levels of splits • n=1: E(1)= - p1 log(p1)- p2 log(p2) • n=2: E(2) = - Si p2,i * log2 (p2,i) C. Faloutsos
Real traffic Entropy E(n) • Has linear entropy plot (-> self-similar) 0.73 # of levels (n) C. Faloutsos
Observation - intuition: Entropy E(n) intuition: slope = intrinsic dimensionality =~ ‘degrees of freedom’ or info-bits per coordinate-bit • unif. Dataset: slope =1 • multi-point: slope = 0 0.73 # of levels (n) C. Faloutsos
Some more entropy plots: • Poisson vs real 0.73 1 Poisson: slope = ~1 -> uniformly distributed C. Faloutsos
B-model • b-model traffic gives perfectly linear plot • Lemma: its slope is slope = -b log2b - (1-b) log2 (1-b) • Fitting: do entropy plot; get slope; solve for b E(n) n C. Faloutsos
Experimental setup • Disk traces (from HP [Wilkes 93]) • web traces from LBL http://repository.cs.vt.edu/ lbl-conn-7.tar.Z C. Faloutsos
Model validation • Linear entropy plots Bias factors b: 0.6-0.8 smallest b / smoothest: nntp traffic C. Faloutsos
Web traffic - results • LBL, NCDF of queue lengths (log-log scales) Prob( >l) (queue length l) C. Faloutsos
Conclusions • Multifractals (80/20, ‘b-model’, Multiplicative Wavelet Model (MWM)) for analysis and synthesis of bursty traffic C. Faloutsos
Books • Fractals: Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!) C. Faloutsos
Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) C. Faloutsos
Clusters/data center monitoring • Monitor correlations of multiple measurements • Automatically flag anomalous behavior • Intemon: intelligent monitoring system • warsteiner.db.cs.cmu.edu/demo/intemon.jsp C. Faloutsos
Publication Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3):38-44. ACM Press, July 2006 C. Faloutsos
Under the hood: SVD • Singular Value Decomposition • Done incrementally Spiros Papadimitriou, Jimeng Sun and Christos Faloutsos Streaming Pattern Discovery in Multiple Time-Series VLDB 2005, Trondheim, Norway. C. Faloutsos
Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) LSI: S. Dumais; M. Berry KL: eg, Duda+Hart PCA: eg., Jolliffe Details: [Press+] u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos
Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos
Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos
Singular Value Decomposition (SVD) • SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU2 t=2 t=1 u of CPU1 C. Faloutsos
Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) C. Faloutsos
BGP updates With • Aditya Prakash (CMU) • Michalis Faloutsos (UC Riverside) • Nicholas Valler (UC Riverside) • Dave Andersen (CMU) C. Faloutsos
Tool #0: Time plot Time Series: #Updates per 600s, Washington Router 09/2004-09/2006 C. Faloutsos
Tool #0: Time plot • Observation #1: Missing values • Observation #2: Bursty C. Faloutsos
Tool #1: Wavelets C. Faloutsos
Wavelets - DWT • Short window Fourier transform (SWFT) • But: how short should be the window? freq value time time C. Faloutsos
Wavelets - DWT • Answer: multiple window sizes! -> DWT Time domain DWT SWFT DFT freq time C. Faloutsos
Haar Wavelets • subtract sum of left half from right half • repeat recursively for quarters, eight-ths, ... C. Faloutsos
Low freq. High freq. time ‘Tornado Plot’ for Washington Router: Dark areas correspond to high energy C. Faloutsos
Tornado Plot: Wavelet Transform for Washington Router 09/2004-09/2006, All coefficients and Detail levels 1-12 • Observations: • Obvious Spikes (E1): • tornados that “touch down” • 2. Prolonged Spikes (E2 and E3): • when coarser scales have high values but finer scales do not • Intermittent Waves (E4 and E5): High-energy entries at nearby scales correspond to local periodic motion C. Faloutsos
Magnification of updates on 28th Aug. 2005 # updates time E2: Prolonged Spike Sustained Period of relatively high Activity C. Faloutsos
Tool #2: logarithms C. Faloutsos
Tool #2: logarithms Prominent `clothesline’ at ~ 50 updates per 600 secs. Culprit IP addresses: 192.211.42.0/24 216.109.38.0/24 207.157.115.0/24 All from Alabama (Supercomputing Center)! C. Faloutsos
Outline • Problem 1: workload characterization • Problem 2: self-* monitoring • Problem 3: BGP mining • (Problem 4: sensor mining) • (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank C. Faloutsos
Main point Two-way street: <- DM can use such infrastructures to find patterns -> DM can help such systems/networks etc to become self-healing, self-adjusting, ‘self-*’ Hot topic in Data Mining: finding patterns in Tera- and Peta-bytes C. Faloutsos