1k likes | 1.16k Views
Data Mining using Fractals and Power laws. Christos Faloutsos Carnegie Mellon University. Thank you!. Prof. Hsing-Kuo Kenneth PAO Prof. Yuh-Jye LEE Hsin Yeh. And also thanks to. Lei LI Leman AKOGLU Ian ROLEWICZ. Ching-Hao (Eric) MAO Ming-Kung (Morgan) SUN Yi-Ren (Ian) YEH. Overview.
E N D
Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University
Thank you! • Prof. Hsing-Kuo Kenneth PAO • Prof. Yuh-Jye LEE • Hsin Yeh C. Faloutsos
And also thanks to Lei LI Leman AKOGLU Ian ROLEWICZ Ching-Hao (Eric) MAO Ming-Kung (Morgan) SUN Yi-Ren (Ian) YEH C. Faloutsos
Overview • Goals/ motivation: find patterns in large datasets: • (A) Sensor data • (B) network intrusion data • Solutions: self-similarity and power laws • Discussion C. Faloutsos
# alerts time Applications of sensors/streams • network monitoring C. Faloutsos
Applications of sensors/streams • Financial, sales, economic series • Medical: ECGs +; blood pressure etc monitoring C. Faloutsos
Motivation - Applications • Scientific data: seismological; astronomical; environment / anti-pollution; meteorological C. Faloutsos
Motivation - Applications (cont’d) • Computer systems • web servers (buffering, prefetching) • ... http://repository.cs.vt.edu/lbl-conn-7.tar.Z C. Faloutsos
Web traffic • [Crovella Bestavros, SIGMETRICS’96] C. Faloutsos
survivable,self-managing storageinfrastructure a storage brick(0.5–5 TB) ~1 PB . . . . . . Self-* Storage (Ganger+) • “self-*” = self-managing, self-tuning, self-healing, … C. Faloutsos
survivable,self-managing storageinfrastructure a storage brick(0.5–5 TB) ~1 PB . . . . . . Self-* Storage (Ganger+) • “self-*” = self-managing, self-tuning, self-healing, … • Goal: 1 petabyte (PB) • www.pdl.cmu.edu/SelfStar C. Faloutsos
Problem definition • Given: one or more sequences x1 , x2 , … , xt , …; (y1, y2, … , yt, …) • Find • patterns; clusters; outliers; forecasts; C. Faloutsos
Problem # bytes • Find patterns, in large datasets time C. Faloutsos
Problem # bytes • Find patterns, in large datasets time Poisson indep., ident. distr C. Faloutsos
Problem # bytes • Find patterns, in large datasets time Poisson indep., ident. distr C. Faloutsos
Problem # bytes • Find patterns, in large datasets time Poisson indep., ident. distr Q: Then, how to generate such bursty traffic? C. Faloutsos
Solutions • New tools: power laws, self-similarity and ‘fractals’ work, where traditional assumptions fail • Let’s see the details: C. Faloutsos
Overview • Goals/ motivation: find patterns in large datasets: • (A) Sensor data • (B) network data • Solutions: self-similarity and power laws • Discussion C. Faloutsos
What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area: (3/4)^inf infinite length! (4/3)^inf ... Q: What is its dimensionality?? C. Faloutsos
What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area: (3/4)^inf infinite length! (4/3)^inf ... Q: What is its dimensionality?? A: log3 / log2 = 1.58 (!?!) C. Faloutsos
Q: fractal dimension of a line? Q: fd of a plane? Intrinsic (‘fractal’) dimension C. Faloutsos
Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs.. log(r) ) Intrinsic (‘fractal’) dimension C. Faloutsos
log(#pairs within <=r ) 1.58 log( r ) Sierpinsky triangle == ‘correlation integral’ = CDF of pairwise distances C. Faloutsos
log(#pairs within <=r ) 1.58 log( r ) Observations: Fractals <-> power laws Closely related: • fractals <=> • self-similarity <=> • scale-free <=> • power laws ( y= xa ; F=K r-2) • (vs y=e-ax or y=xa+b) C. Faloutsos
Outline • Problems • Self-similarity and power laws • Solutions to posed problems • Discussion C. Faloutsos
#bytes time Solution #1: traffic • disk traces: self-similar: (also: [Leland+94]) • How to generate such traffic? C. Faloutsos
20% 80% Solution #1: traffic • disk traces (80-20 ‘law’) – ‘multifractals’ #bytes time C. Faloutsos
80-20 / multifractals 20 80 C. Faloutsos
80-20 / multifractals 20 80 • p ; (1-p) in general • yes, there are dependencies C. Faloutsos
More on 80/20: PQRS • Part of ‘self-* storage’ project time cylinder# C. Faloutsos
p q r s More on 80/20: PQRS • Part of ‘self-* storage’ project q r s C. Faloutsos
Overview • Goals/ motivation: find patterns in large datasets: • (A) Sensor data • (B) network data • Solutions: self-similarity and power laws • sensor/traffic data • network data • Discussion C. Faloutsos
Problem dfn <source-ip, target-ip, timestamp, alert-type> eg., <192.168.2.5; 128.2.220.159; 3am june 6; ICMP-redirect-host> goal: find patterns / anomalies C. Faloutsos
Power laws in intrusion data count rank C. Faloutsos
human-like robot-like Q: Can we visually summarize / classify our sequences? robot-like (bursty) C. Faloutsos
Answer: yes! two features: • F1: how periodic (24h-cycle) is a sequence • F2: how bursty it is Q: how to measure burstiness? A: Fractal dimension! C. Faloutsos
Burstiness & f.d. uniform: fd = 1 @same time-tick: fd = 0 C. Faloutsos
Burstiness & f.d. uniform: fd = 1 bursts within bursts within bursts: 0<fd<1 @same time-tick: fd = 0 C. Faloutsos
6. Proposed Methods: The FDP Plot C. Faloutsos Notice: clustering wrt alert types!
human-like robot-like can we visually summarize / classify our sequences? robot-like (bursty) C. Faloutsos
Examples human-like behavior C. Faloutsos
Examples human-like behavior C. Faloutsos
Examples human-like behavior C. Faloutsos
Examples human-like behavior C. Faloutsos
Outline • problems • Fractals • Solutions • Discussion • what else can they solve? • how frequent are fractals? C. Faloutsos
What else can they solve? • separability [KDD’02] • forecasting [CIKM’02] • dimensionality reduction [SBBD’00] • non-linear axis scaling [KDD’02] • disk trace modeling [PEVA’02] • selectivity of spatial/multimedia queries [PODS’94, VLDB’95, ICDE’00] • ... C. Faloutsos
Problem #3 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) • - ‘spiral’ and ‘elliptical’ galaxies • - patterns? (not Gaussian; not uniform) • attraction/repulsion? • separability?? C. Faloutsos
Solution#3: spatial d.m. CORRELATION INTEGRAL! log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos
Solution#3: spatial d.m. [w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos