1 / 25

Stream Clustering

Stream Clustering. CSE 902. Big Data. Stream analysis. Stream : Continuous flow of data Challenges Volume: Not possible to store all the data One-time access: Not possible to process the data using multiple passes

alessa
Download Presentation

Stream Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stream Clustering CSE 902

  2. Big Data

  3. Stream analysis • Stream:Continuous flow of data • Challenges • Volume: Not possible to store all the data • One-time access: Not possible to process the data using multiple passes • Real-time analysis: Certain applications need real-time analysis of the data • Temporal Locality: Data evolves over time, so model should be adaptive.

  4. Stream Clustering Topic cluster Article Listings

  5. Stream Clustering • Online Phase • Summarize the data into memory-efficient data structures • Offline Phase • Use a clustering algorithm to find the data partition

  6. Stream Clustering Algorithms

  7. Prototypes Stream, LSearch

  8. CF-Trees Summarize the data in each CF-vector • Linear sum of data points • Squared sum of data points • Number of points Scalable k-means, Single pass k-means

  9. Microclusters CF-Trees with “time” element • CluStream • Linear sum and square sum of timestamps • Delete old microclusters/merging microclusters if their timestamps are close to each other • Sliding Window Clustering • Timestamp of the most recent data point added to the vector • Maintain only the most recent T microclusters • DenStream • Microclusters are associated with weights based on recency • Outliers detected by creating separate microcluster

  10. Grids • D-Stream • Assign the data to grids • Grids weighted by recency of points added to it • Each grid associated with a label • DGClust • Distributed clustering of sensor data • Sensors maintain local copies of the grid and communicate updates to the grid to a central site

  11. StreamKM++ (Coresets) • A weighted set S is a coreset for a data set D if the clustering of S approximates the clustering of D with an error margin of • Maintain data in buckets Buckets to contain either exactly contains 0 or m points. can have any number of points between 0 to m points. • Merge data in buckets using coreset tree. StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012

  12. Kernel-based Clustering

  13. Kernel-based Stream Clustering • Use non-linear distance measures to define similarity between data points in the stream • Challenges • Quadratic running time complexity • Computationally expensive to compute centers using linear sums and squared sums (CF-vector approach will not work)

  14. Stream Kernel k-means (sKKM) Kernel k-means Weighted Kernel k-means History from only the preceding data chunk retained Approximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012

  15. Statistical Leverage Scores Measures the influence of a point in the low-rank approximation Leverage score

  16. Statistical Leverage Scores Used to characterize the matrices which can be approximated accurately with a sample of the entries Leverage scores are 1, 1, 1 – all rows are equally important All the entries of the matrix need to be sampled If singular vectors/eigenvectors are spread out(uncorrelated with the standard basis), then we can approximate the matrix with a small number of samples

  17. Approximate Stream kernel k-means • Uses statistical leverage score to determine which data points in the stream are potentially “important” • Retain the important points and discard the rest • Use an approximate version of kernel k-means to obtain the clusters – Linear time complexity • Bounded amount of memory

  18. Approximate Stream kernel k-means

  19. Importance Sampling • Sampling probability • Kernel matrix construction

  20. Clustering • Using kernel k-means to recluster M each time a point is added will be expensive • Reduce complexity by employing a low-dimensional representation of the data • Constrain the cluster centers to the top k eigenvectors of the kernel matrix Kernel k-means “Approximate”Kernel k-means

  21. Clustering “Approximate”Kernel k-means Solve by running k-means on - running time complexity

  22. Updating eigenvectors • Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and clustering • Update the eigenvectors and eigenvalues incrementally running time complexity contains the eigenvalues of sparse matrix Component orthogonal to

  23. Approximate Stream Kernel k-means

  24. Network Traffic Monitoring • Clustering used to detect intrusions in the network • Network Intrusion Data set • TCP dump data from seven weeks of LAN traffic • 10 classes: 9 types of intrusions, 1 class of legitimate traffic. Around 200 points clustered per second

  25. Summary • Efficient kernel-based stream clustering algorithm - linear running time complexity • Memory required is bounded • Real-time clustering is possible • Limitation: does not account for data evolution

More Related