1 / 56

Statistical Approaches to Mining Multivariate Data Streams

Statistical Approaches to Mining Multivariate Data Streams. Eric Vance, Duke University Department of Statistical Science David Banks, Duke University Tamraparni Dasu, AT&T Labs - Research. JSM July 31, 2007 Salt Lake City, Utah. Data Streams. Huge amounts of complex data

tudor
Download Presentation

Statistical Approaches to Mining Multivariate Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Approaches to Mining Multivariate Data Streams Eric Vance, Duke University Department of Statistical Science David Banks, Duke University Tamraparni Dasu, AT&T Labs - Research JSM July 31, 2007 Salt Lake City, Utah

  2. Data Streams • Huge amounts of complex data • Rapid rate of accumulation • One-time access to raw data

  3. Change Detection • Problem: Change detection in complicated data streams • Three criteria • Nonparametric • Fast • Statistical guarantees

  4. E-Commerce Server Data • Data description • One server in a network of servers • Data polled every 5 minutes • 27 variables • 5 week time period • Quality issues • Many variables unimportant or unchanging • Missing data

  5. E-Commerce Server Data • Variable Elimination • Several variables remain constant (Total Swap) • Predictable and non-informative (Used SysDisk Space) • Correlated (CPU Used%, CPU User%) • 6 Variables Selected • CPU Used% • Number of Procs • Number of Threads • Ping Latency • Used Swap • Log Packets = log(In Packets + Out Packets)

  6. Daily Boxplots CPU Procs Threads

  7. Daily Boxplots Ping Swap Packets

  8. Our Approach • Partitioning scheme (Multivariate Histogram) • Data depth: Rank each point in relation to Mahalanobis distance from center of data • Data Pyramid: Determine which direction in the data is most extreme • Profile based comparison • Identify changes in profiles over time (week to week)

  9. Partitioning Example in 2D • 5 “center-outward” depth layers • 4 pyramids A: depth 4 pyramid +y B: depth 3 pyramid -x C: depth 1 pyramid +x A C B

  10. Identify Depth and Direction • Compute center of comparison Data Sphere • Calculate Mahalanobis distance for each point • In which of the 5 quantiles of depth is • Determine direction of greatest variation for

  11. Data Partition in 6 Dimensions

  12. Partitioning Into Bins

  13. Partitioning Into Bins CPU

  14. Partitioning Into Bins CPU CPU

  15. Partitioning Into Bins CPU Procs

  16. Partitioning Into Bins Procs CPU Procs

  17. Partitioning Into Bins Procs CPU Threads

  18. Partitioning Into Bins Procs CPU Threads Threads

  19. Partitioning Into Bins Procs CPU Threads Threads Threads

  20. Partitioning Into Bins Procs CPU Threads Threads Threads Threads

  21. Partitioning Into Bins Procs CPU Threads Ping

  22. Partitioning Into Bins Procs CPU Threads Ping Ping

  23. Partitioning Into Bins Procs CPU Threads Ping Swap

  24. Partitioning Into Bins Procs CPU Threads Ping Swap Swap

  25. Partitioning Into Bins Procs CPU Threads Ping Swap Packets

  26. Partitioning Into Bins Procs CPU Packets Threads Ping Swap Packets

  27. Partitioning Into Bins: Weeks 1 and 2

  28. Partitioning Into Bins: Weeks 1 and 2 Swap

  29. Partitioning Into Bins: Weeks 1 and 2 Swap Swap

  30. Partitioning Into Bins: Weeks 1 and 2 Swap Swap Swap

  31. Partitioning Into Bins: Weeks 1 and 2 Swap Swap

  32. Partitioning Into Bins: Weeks 1 and 2 Threads

  33. Partitioning Into Bins: Weeks 1 and 2 Threads Threads

  34. 6 Variables Over 5 Weeks

  35. Week 1 Results

  36. Week 2 Results

  37. 6 Variables Over 5 Weeks

  38. 6 Variables Over 5 Weeks Threads

  39. Week 3 Results

  40. Week 3 Results Threads

  41. Week 4 Results

  42. Week 4 Results Procs

  43. Week 4 Results Threads

  44. 6 Variables Over 5 Weeks

  45. 6 Variables Over 5 Weeks Packets

  46. Week 5 Results

  47. Week 5 Results Packets

  48. Trimmed Mean and Covariance • Applying partitioning method using a trimmed mean and trimmed covariance matrix • Use 90% most central data points • Recompute center • Recompute Mahalanobis distances using new center and new covariance matrix • Bins are more uniformly filled • More points appear closer to trimmed center • More variables become “most extreme”

  49. Trimmed Comparison: Week 1

  50. Trimmed Results: Week 1

More Related