310 likes | 645 Views
Stream Clustering. CSE 902. Big Data. Stream analysis. Stream : Continuous flow of data Challenges Volume: Not possible to store all the data One-time access: Not possible to process the data using multiple passes
E N D
Stream Clustering CSE 902
Stream analysis • Stream:Continuous flow of data • Challenges • Volume: Not possible to store all the data • One-time access: Not possible to process the data using multiple passes • Real-time analysis: Certain applications need real-time analysis of the data • Temporal Locality: Data evolves over time, so model should be adaptive.
Stream Clustering Topic cluster Article Listings
Stream Clustering • Online Phase • Summarize the data into memory-efficient data structures • Offline Phase • Use a clustering algorithm to find the data partition
Prototypes Stream, LSearch
CF-Trees Summarize the data in each CF-vector • Linear sum of data points • Squared sum of data points • Number of points Scalable k-means, Single pass k-means
Microclusters CF-Trees with “time” element • CluStream • Linear sum and square sum of timestamps • Delete old microclusters/merging microclusters if their timestamps are close to each other • Sliding Window Clustering • Timestamp of the most recent data point added to the vector • Maintain only the most recent T microclusters • DenStream • Microclusters are associated with weights based on recency • Outliers detected by creating separate microcluster
Grids • D-Stream • Assign the data to grids • Grids weighted by recency of points added to it • Each grid associated with a label • DGClust • Distributed clustering of sensor data • Sensors maintain local copies of the grid and communicate updates to the grid to a central site
StreamKM++ (Coresets) • A weighted set S is a coreset for a data set D if the clustering of S approximates the clustering of D with an error margin of • Maintain data in buckets Buckets to contain either exactly contains 0 or m points. can have any number of points between 0 to m points. • Merge data in buckets using coreset tree. StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012
Kernel-based Stream Clustering • Use non-linear distance measures to define similarity between data points in the stream • Challenges • Quadratic running time complexity • Computationally expensive to compute centers using linear sums and squared sums (CF-vector approach will not work)
Stream Kernel k-means (sKKM) Kernel k-means Weighted Kernel k-means History from only the preceding data chunk retained Approximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012
Statistical Leverage Scores Measures the influence of a point in the low-rank approximation Leverage score
Statistical Leverage Scores Used to characterize the matrices which can be approximated accurately with a sample of the entries Leverage scores are 1, 1, 1 – all rows are equally important All the entries of the matrix need to be sampled If singular vectors/eigenvectors are spread out(uncorrelated with the standard basis), then we can approximate the matrix with a small number of samples
Approximate Stream kernel k-means • Uses statistical leverage score to determine which data points in the stream are potentially “important” • Retain the important points and discard the rest • Use an approximate version of kernel k-means to obtain the clusters – Linear time complexity • Bounded amount of memory
Importance Sampling • Sampling probability • Kernel matrix construction
Clustering • Using kernel k-means to recluster M each time a point is added will be expensive • Reduce complexity by employing a low-dimensional representation of the data • Constrain the cluster centers to the top k eigenvectors of the kernel matrix Kernel k-means “Approximate”Kernel k-means
Clustering “Approximate”Kernel k-means Solve by running k-means on - running time complexity
Updating eigenvectors • Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and clustering • Update the eigenvectors and eigenvalues incrementally running time complexity contains the eigenvalues of sparse matrix Component orthogonal to
Network Traffic Monitoring • Clustering used to detect intrusions in the network • Network Intrusion Data set • TCP dump data from seven weeks of LAN traffic • 10 classes: 9 types of intrusions, 1 class of legitimate traffic. Around 200 points clustered per second
Summary • Efficient kernel-based stream clustering algorithm - linear running time complexity • Memory required is bounded • Real-time clustering is possible • Limitation: does not account for data evolution