1 / 100

Data Mining using Fractals and Power laws

Data Mining using Fractals and Power laws. Christos Faloutsos Carnegie Mellon University. Thank you!. Prof. Hsing-Kuo Kenneth PAO Prof. Yuh-Jye LEE Hsin Yeh. And also thanks to. Lei LI Leman AKOGLU Ian ROLEWICZ. Ching-Hao (Eric) MAO Ming-Kung (Morgan) SUN Yi-Ren (Ian) YEH. Overview.

Download Presentation

Data Mining using Fractals and Power laws

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University

  2. Thank you! • Prof. Hsing-Kuo Kenneth PAO • Prof. Yuh-Jye LEE • Hsin Yeh C. Faloutsos

  3. And also thanks to Lei LI Leman AKOGLU Ian ROLEWICZ Ching-Hao (Eric) MAO Ming-Kung (Morgan) SUN Yi-Ren (Ian) YEH C. Faloutsos

  4. Overview • Goals/ motivation: find patterns in large datasets: • (A) Sensor data • (B) network intrusion data • Solutions: self-similarity and power laws • Discussion C. Faloutsos

  5. # alerts time Applications of sensors/streams • network monitoring C. Faloutsos

  6. Applications of sensors/streams • Financial, sales, economic series • Medical: ECGs +; blood pressure etc monitoring C. Faloutsos

  7. Motivation - Applications • Scientific data: seismological; astronomical; environment / anti-pollution; meteorological C. Faloutsos

  8. Motivation - Applications (cont’d) • Computer systems • web servers (buffering, prefetching) • ... http://repository.cs.vt.edu/lbl-conn-7.tar.Z C. Faloutsos

  9. Web traffic • [Crovella Bestavros, SIGMETRICS’96] C. Faloutsos

  10. survivable,self-managing storageinfrastructure a storage brick(0.5–5 TB) ~1 PB . . . . . . Self-* Storage (Ganger+) • “self-*” = self-managing, self-tuning, self-healing, … C. Faloutsos

  11. survivable,self-managing storageinfrastructure a storage brick(0.5–5 TB) ~1 PB . . . . . . Self-* Storage (Ganger+) • “self-*” = self-managing, self-tuning, self-healing, … • Goal: 1 petabyte (PB) • www.pdl.cmu.edu/SelfStar C. Faloutsos

  12. Problem definition • Given: one or more sequences x1 , x2 , … , xt , …; (y1, y2, … , yt, …) • Find • patterns; clusters; outliers; forecasts; C. Faloutsos

  13. Problem # bytes • Find patterns, in large datasets time C. Faloutsos

  14. Problem # bytes • Find patterns, in large datasets time Poisson indep., ident. distr C. Faloutsos

  15. Problem # bytes • Find patterns, in large datasets time Poisson indep., ident. distr C. Faloutsos

  16. Problem # bytes • Find patterns, in large datasets time Poisson indep., ident. distr Q: Then, how to generate such bursty traffic? C. Faloutsos

  17. Solutions • New tools: power laws, self-similarity and ‘fractals’ work, where traditional assumptions fail • Let’s see the details: C. Faloutsos

  18. Overview • Goals/ motivation: find patterns in large datasets: • (A) Sensor data • (B) network data • Solutions: self-similarity and power laws • Discussion C. Faloutsos

  19. What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area: (3/4)^inf infinite length! (4/3)^inf ... Q: What is its dimensionality?? C. Faloutsos

  20. What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area: (3/4)^inf infinite length! (4/3)^inf ... Q: What is its dimensionality?? A: log3 / log2 = 1.58 (!?!) C. Faloutsos

  21. Q: fractal dimension of a line? Q: fd of a plane? Intrinsic (‘fractal’) dimension C. Faloutsos

  22. Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs.. log(r) ) Intrinsic (‘fractal’) dimension C. Faloutsos

  23. log(#pairs within <=r ) 1.58 log( r ) Sierpinsky triangle == ‘correlation integral’ = CDF of pairwise distances C. Faloutsos

  24. log(#pairs within <=r ) 1.58 log( r ) Observations: Fractals <-> power laws Closely related: • fractals <=> • self-similarity <=> • scale-free <=> • power laws ( y= xa ; F=K r-2) • (vs y=e-ax or y=xa+b) C. Faloutsos

  25. Outline • Problems • Self-similarity and power laws • Solutions to posed problems • Discussion C. Faloutsos

  26. #bytes time Solution #1: traffic • disk traces: self-similar: (also: [Leland+94]) • How to generate such traffic? C. Faloutsos

  27. 20% 80% Solution #1: traffic • disk traces (80-20 ‘law’) – ‘multifractals’ #bytes time C. Faloutsos

  28. 80-20 / multifractals 20 80 C. Faloutsos

  29. 80-20 / multifractals 20 80 • p ; (1-p) in general • yes, there are dependencies C. Faloutsos

  30. More on 80/20: PQRS • Part of ‘self-* storage’ project time cylinder# C. Faloutsos

  31. p q r s More on 80/20: PQRS • Part of ‘self-* storage’ project q r s C. Faloutsos

  32. Overview • Goals/ motivation: find patterns in large datasets: • (A) Sensor data • (B) network data • Solutions: self-similarity and power laws • sensor/traffic data • network data • Discussion C. Faloutsos

  33. Problem dfn <source-ip, target-ip, timestamp, alert-type> eg., <192.168.2.5; 128.2.220.159; 3am june 6; ICMP-redirect-host> goal: find patterns / anomalies C. Faloutsos

  34. Power laws in intrusion data count rank C. Faloutsos

  35. C. Faloutsos

  36. human-like robot-like Q: Can we visually summarize / classify our sequences? robot-like (bursty) C. Faloutsos

  37. Answer: yes! two features: • F1: how periodic (24h-cycle) is a sequence • F2: how bursty it is Q: how to measure burstiness? A: Fractal dimension! C. Faloutsos

  38. Burstiness & f.d. uniform: fd = 1 @same time-tick: fd = 0 C. Faloutsos

  39. Burstiness & f.d. uniform: fd = 1 bursts within bursts within bursts: 0<fd<1 @same time-tick: fd = 0 C. Faloutsos

  40. 6. Proposed Methods: The FDP Plot C. Faloutsos Notice: clustering wrt alert types!

  41. human-like robot-like can we visually summarize / classify our sequences? robot-like (bursty) C. Faloutsos

  42. Examples human-like behavior C. Faloutsos

  43. Examples human-like behavior C. Faloutsos

  44. Examples human-like behavior C. Faloutsos

  45. Examples human-like behavior C. Faloutsos

  46. Outline • problems • Fractals • Solutions • Discussion • what else can they solve? • how frequent are fractals? C. Faloutsos

  47. What else can they solve? • separability [KDD’02] • forecasting [CIKM’02] • dimensionality reduction [SBBD’00] • non-linear axis scaling [KDD’02] • disk trace modeling [PEVA’02] • selectivity of spatial/multimedia queries [PODS’94, VLDB’95, ICDE’00] • ... C. Faloutsos

  48. Problem #3 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) • - ‘spiral’ and ‘elliptical’ galaxies • - patterns? (not Gaussian; not uniform) • attraction/repulsion? • separability?? C. Faloutsos

  49. Solution#3: spatial d.m. CORRELATION INTEGRAL! log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos

  50. Solution#3: spatial d.m. [w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos

More Related