1 / 133

Data Mining Tools

Data Mining Tools. A crash course C. Faloutsos. Subset of:. www.cs.cmu.edu/~christos/TALKS/ SIGMETRICS03-tut/. High-level Outline. [I - Traditional Data Mining tools classification, CART trees; clustering II - Time series: analysis and forecasting ARIMA; Fourier, Wavelets]

tovi
Download Presentation

Data Mining Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Tools A crash course C. Faloutsos

  2. Subset of: www.cs.cmu.edu/~christos/TALKS/ SIGMETRICS03-tut/ C. Faloutsos

  3. High-level Outline • [I - Traditional Data Mining tools • classification, CART trees; clustering • II - Time series: analysis and forecasting • ARIMA; Fourier, Wavelets] • III - New Tools: SVD • IV - New Tools: Fractals & power laws C. Faloutsos

  4. High-level Outline • [I - Traditional Data Mining tools • II - Time series: analysis and forecasting] • III - New Tools: SVD • IV - New Tools: Fractals & power laws C. Faloutsos

  5. III - SVD - outline • Introduction - motivating problems • Definition - properties • Interpretation / Intuition • Solutions to posed problems • Conclusions C. Faloutsos

  6. SVD - Motivation • problem #1: find patterns in a matrix • (e.g., traffic patterns from several IP-sources) • compression; dim. reduction C. Faloutsos

  7. Problem#1 • ~10**6 rows; ~10**3 columns; no updates; • Compress / find patterns C. Faloutsos

  8. SVD - in short: It gives the best hyperplane to project on C. Faloutsos

  9. SVD - in short: It gives the best hyperplane to project on C. Faloutsos

  10. III - SVD - outline • Introduction - motivating problems • Definition - properties • Interpretation / Intuition • Solutions to posed problems • Conclusions C. Faloutsos

  11. SVD - Definition • A = ULVT - example: C. Faloutsos

  12. Skip SVD - notation Conventions: • bold capitals -> matrix (eg. A, U, L, V) • bold lower-case -> column vector (eg., x, v1, u3) • regular lower-case -> scalars (eg., l1 , lr ) C. Faloutsos

  13. Skip SVD - Definition A[n x m] = U[n x r]L [ r x r] (V[m x r])T • A: n x m matrix (eg., n customers, m days) • U: n x r matrix (n customers, r concepts) • L: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) • V: m x r matrix (m days, r concepts) C. Faloutsos

  14. Skip SVD - Properties THEOREM [Press+92]:always possible to decomposematrix A into A = ULVT , where • U,L,V: unique (*) • U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other) • UTU = I; VTV = I (I: identity matrix) • L: eigenvalues are positive, and sorted in decreasing order C. Faloutsos

  15. Comm. Res. SVD - example • Customers; days; #packets C. Faloutsos

  16. Fr Th. Su Sa We Com. x x = Res. SVD - Example • A = ULVT - example: C. Faloutsos

  17. III - SVD - outline • Introduction - motivating problems • Definition - properties • Interpretation / Intuition • #1: customers, days, concepts • #2: best projection - dimensionality reduction • Solutions to posed problems • Conclusions C. Faloutsos

  18. SVD - Interpretation #1 ‘customers’, ‘days’ and ‘concepts’ • U: customer-to-concept similarity matrix • V: day-to-concept sim. matrix • L: its diagonal elements: ‘strength’ of each concept C. Faloutsos

  19. Fr Th. Su Sa We Com. x x = Res. SVD - Interpretation #1 • A = ULVT - example: Rank=2 2x2 C. Faloutsos

  20. Fr Th. Su Sa We Com. x x = Res. SVD - Interpretation #1 • A = ULVT - example: Rank=2 =2 ‘concepts’ C. Faloutsos

  21. Comm. Res. (reminder) • Customers; days; #packets C. Faloutsos

  22. Fr Th. Su Sa We Com. x x = Res. SVD - Interpretation #1 U: customer-to-concept similarity matrix • A = ULVT - example: weekday-concept W/end-concept C. Faloutsos

  23. Fr Th. Su Sa We Com. x x = Res. SVD - Interpretation #1 U: Customer to concept similarity matrix • A = ULVT - example: weekday-concept W/end-concept C. Faloutsos

  24. Skip Fr Th. Su Sa We Com. x x = Res. SVD - Interpretation #1 • A = ULVT - example: unit C. Faloutsos

  25. Fr Th. Su Sa We Com. x x = Res. SVD - Interpretation #1 Strength of ‘weekday’ concept • A = ULVT - example: weekday-concept C. Faloutsos

  26. Fr Th. Su Sa We Com. x x = Res. SVD - Interpretation #1 • A = ULVT - example: weekday-concept V: day to concept similarity matrix C. Faloutsos

  27. III - SVD - outline • Introduction - motivating problems • Definition - properties • Interpretation / Intuition • #1: customers, days, concepts • #2: best projection - dimensionality reduction • Solutions to posed problems • Conclusions C. Faloutsos

  28. SVD - Interpretation #2 • best axis to project on: (‘best’ = min sum of squares of projection errors) C. Faloutsos

  29. SVD - Interpretation #2 C. Faloutsos

  30. SVD - Interpretation#2 C. Faloutsos

  31. minimum RMS error SVD - interpretation #2 SVD: gives best axis to project v1 C. Faloutsos

  32. x x = v1 SVD - Interpretation #2 • A = ULVT - example: C. Faloutsos

  33. SVD - Interpretation #2 • A = ULVT - example: variance (‘spread’) on the v1 axis x x = C. Faloutsos

  34. minimum RMS error SVD - interpretation #2 SVD: gives best axis to project v1 ~ l1 C. Faloutsos

  35. SVD, PCA and the v vectors how to ‘read’ the v vectors (= principal components) C. Faloutsos

  36. SVD • Recall: A = ULVT - example: x x = C. Faloutsos

  37. v1 v2 We Th Fr Sa Su SVD • First Principal component = v1 -> weekdays are correlated positively • similarly for v2 • (we’ll see negative correlations later) C. Faloutsos

  38. Skip SVD - Complexity • O( n * m * m) or O( n * n * m) (whichever is less) • less work, if we just want eigenvalues • ... or if we want first k eigenvectors • ... or if the matrix is sparse [Berry] • Implemented: in any linear algebra package (LINPACK, matlab, Splus, mathematica ...) C. Faloutsos

  39. SVD - conclusions so far • SVD: A= U L VT : unique (*) • U: row-to-concept similarities • V: column-to-concept similarities • L: strength of each concept (*) see [Press+92] C. Faloutsos

  40. SVD - conclusions so far • dim. reduction: keep the first few strongest eigenvalues (80-90% of ‘energy’ [Fukunaga]) • SVD: picks up linear correlations C. Faloutsos

  41. III - SVD - outline • Introduction - motivating problems • Definition - properties • Interpretation / Intuition • Solutions to posed problems • P1: patterns in a matrix; compression • Conclusions C. Faloutsos

  42. SVD & visualization: • Visualization for free! • Time-plots are not enough: C. Faloutsos

  43. SVD & visualization: • Visualization for free! • Time-plots are not enough: C. Faloutsos

  44. SVD: project 365-d vectors to best 2 dimensions, and plot: no Gaussian clusters; Zipf-like distribution phonecalls SVD & visualization C. Faloutsos

  45. SVD and visualization • NBA dataset • ~500 players; • ~30 attributes • (#games, • #points, • #rebounds,…) C. Faloutsos

  46. could be network dataset: N IP sources k attributes (#http bytes, #http packets) SVD and visualization C. Faloutsos

  47. SVD ~ PCA = Principal component analysis PCA: get eigenvectors v1, v2, ... ignore entries with small abs. value try to interpret the rest Moreover, PCA/rules for free! C. Faloutsos

  48. Skip PCA & Rules NBA dataset - V matrix (term to ‘concept’ similarities) v1 C. Faloutsos

  49. Skip PCA & Rules • (Ratio) Rule#1: minutes:points = 2:1 • corresponding concept? v1 C. Faloutsos

  50. Skip PCA & Rules • RR1: minutes:points = 2:1 • corresponding concept? • A: ‘goodness’ of player • (in a systems setting, could be ‘volume of traffic’ generated by this IP address) C. Faloutsos

More Related