1 / 83

Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional Data

Explore finding patterns in large datasets using self-similarity and power laws with a focus on sensor data, network/graph data, and various applications like smart house monitoring and financial series. Learn about fractals, graphs, and more from Christos Faloutsos at Carnegie Mellon University.

tommyhanson
Download Presentation

Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional Data Christos Faloutsos Carnegie Mellon University C. Faloutsos

  2. THANK YOU! • Prof. Jiawei Han • Prof. Kevin Chang C. Faloutsos

  3. Overview • Goals/ motivation: find patterns in large datasets: • (A) Sensor data • (B) network/graph data • Solutions: self-similarity and power laws • Discussion C. Faloutsos

  4. Applications of sensors/streams • ‘Smart house’: monitoring temperature, humidity etc • Financial, sales, economic series C. Faloutsos

  5. Motivation - Applications • Medical: ECGs +; blood pressure etc monitoring • Scientific data: seismological; astronomical; environment / anti-pollution; meteorological C. Faloutsos

  6. Motivation - Applications (cont’d) • Computer systems • web servers (buffering, prefetching) • network traffic monitoring • ... http://repository.cs.vt.edu/lbl-conn-7.tar.Z C. Faloutsos

  7. Problem definition • Given: one or more sequences x1 , x2 , … , xt , …; (y1, y2, … , yt, …) • Find • patterns; clusters; outliers; forecasts; C. Faloutsos

  8. Problem #1 # bytes • Find patterns, in large datasets time C. Faloutsos

  9. Problem #1 # bytes • Find patterns, in large datasets time Poisson indep., ident. distr C. Faloutsos

  10. Problem #1 # bytes • Find patterns, in large datasets time Poisson indep., ident. distr C. Faloutsos

  11. Problem #1 # bytes • Find patterns, in large datasets time Poisson indep., ident. distr Q: Then, how to generate such bursty traffic? C. Faloutsos

  12. Overview • Goals/ motivation: find patterns in large datasets: • (A) Sensor data • (B) network/graph data • Solutions: self-similarity and power laws • Discussion C. Faloutsos

  13. Problem #2 - network and graph mining • How does the Internet look like? • How does the web look like? • What constitutes a ‘normal’ social network? • What is the ‘market value’ of a customer? • which gene/species affects the others the most? C. Faloutsos

  14. Problem#2 Given a graph: • which node to market-to / defend / immunize first? • Are there un-natural sub-graphs? (eg., criminals’ rings)? [from Lumeta: ISPs 6/1999] C. Faloutsos

  15. Solutions • New tools: power laws, self-similarity and ‘fractals’ work, where traditional assumptions fail • Let’s see the details: C. Faloutsos

  16. Overview • Goals/ motivation: find patterns in large datasets: • (A) Sensor data • (B) network/graph data • Solutions: self-similarity and power laws • Discussion C. Faloutsos

  17. What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area: (3/4)^inf infinite length! (4/3)^inf ... Q: What is its dimensionality?? C. Faloutsos

  18. What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area: (3/4)^inf infinite length! (4/3)^inf ... Q: What is its dimensionality?? A: log3 / log2 = 1.58 (!?!) C. Faloutsos

  19. Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs.. log(r) ) Intrinsic (‘fractal’) dimension C. Faloutsos

  20. log(#pairs within <=r ) 1.58 log( r ) Sierpinsky triangle == ‘correlation integral’ = CDF of pairwise distances C. Faloutsos

  21. log(#pairs within <=r ) 1.58 log( r ) Observations: Fractals <-> power laws Closely related: • fractals <=> • self-similarity <=> • scale-free <=> • power laws ( y= xa ; F=K r-2) • (vs y=e-ax or y=xa+b) C. Faloutsos

  22. Outline • Problems • Self-similarity and power laws • Solutions to posed problems • Discussion C. Faloutsos

  23. #bytes time Solution #1: traffic • disk traces: self-similar: (also: [Leland+94]) • How to generate such traffic? C. Faloutsos

  24. 20% 80% Solution #1: traffic • disk traces (80-20 ‘law’) – ‘multifractals’ #bytes time C. Faloutsos

  25. 80-20 / multifractals 20 80 C. Faloutsos

  26. 80-20 / multifractals 20 80 • p ; (1-p) in general • yes, there are dependencies C. Faloutsos

  27. Overview • Goals/ motivation: find patterns in large datasets: • (A) Sensor data • (B) network/graph data • Solutions: self-similarity and power laws • sensor/traffic data • network/graph data • Discussion C. Faloutsos

  28. Problem #2 - topology How does the Internet look like? Any rules? C. Faloutsos

  29. Patterns? • avg degree is, say 3.3 • pick a node at random - what is the degree you expect it to have? count ? avg: 3.3 degree C. Faloutsos

  30. Patterns? • avg degree is, say 3.3 • pick a node at random - what is the degree you expect it to have? • A: 1!! count avg: 3.3 degree C. Faloutsos

  31. Patterns? • avg degree is, say 3.3 • pick a node at random - what is the degree you expect it to have? • A: 1!! • A’: very skewed distr. • Corollary: the mean is meaningless! • (and std -> infinity (!)) count avg: 3.3 degree C. Faloutsos

  32. -0.82 att.com log(degree) ibm.com log(rank) Solution#2: Rank exponent R • A1: Power law in the degree distribution [SIGCOMM99] internet domains C. Faloutsos

  33. Solution#2’: Eigen Exponent E Eigenvalue • A2: power law in the eigenvalues of the adjacency matrix Exponent = slope E = -0.48 May 2001 Rank of decreasing eigenvalue C. Faloutsos

  34. Power laws - discussion • do they hold, over time? • do they hold on other graphs/domains? C. Faloutsos

  35. Power laws - discussion • do they hold, over time? • Yes! for multiple years [Siganos+] • do they hold on other graphs/domains? • Yes! • web sites and links [Tomkins+], [Barabasi+] • peer-to-peer graphs (gnutella-style) • who-trusts-whom (epinions.com) C. Faloutsos

  36. att.com -0.82 log(degree) ibm.com log(rank) Time Evolution: rank R Domain level • The rank exponent has not changed! [Siganos+] C. Faloutsos

  37. The Peer-to-Peer Topology count [Jovanovic+] • Number of immediate peers (= degree), follows a power-law degree C. Faloutsos

  38. epinions.com count • who-trusts-whom [Richardson + Domingos, KDD 2001] (out) degree C. Faloutsos

  39. Outline • problems • Fractals • Solutions • Discussion • what else can they solve? • how frequent are fractals? C. Faloutsos

  40. What else can they solve? • separability [KDD’02] • forecasting [CIKM’02] • dimensionality reduction [SBBD’00] • non-linear axis scaling [KDD’02] • disk trace modeling [PEVA’02] • selectivity of spatial queries [PODS’94, VLDB’95, ICDE’00] • ... C. Faloutsos

  41. Problem #3 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) • - ‘spiral’ and ‘elliptical’ galaxies • - patterns? (not Gaussian; not uniform) • attraction/repulsion? • separability?? C. Faloutsos

  42. Solution#3: spatial d.m. CORRELATION INTEGRAL! log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos

  43. Solution#3: spatial d.m. [w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos

  44. r1 r2 r2 r1 spatial d.m. Heuristic on choosing # of clusters C. Faloutsos

  45. Solution#3: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos

  46. cc mpg skip Problem#4: dim. reduction • given attributes x1, ... xn • possibly, non-linearly correlated • drop the useless ones C. Faloutsos

  47. cc mpg skip Problem#4: dim. reduction • given attributes x1, ... xn • possibly, non-linearly correlated • drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) Solution: keep on dropping attributes, until the f.d. changes! [SBBD’00] C. Faloutsos

  48. Outline • problems • Fractals • Solutions • Discussion • what else can they solve? • how frequent are fractals? C. Faloutsos

  49. Fractals & power laws: appear in numerous settings: • medical • geographical / geological • social • computer-system related • <and many-many more! see [Mandelbrot]> C. Faloutsos

  50. Log(#octants) 2.63 = fd octree levels Fractals: Brain scans • brain-scans C. Faloutsos

More Related