700 likes | 1.35k Views
Data Stream Clustering. Jiaqi Gu and Hsuan Chiu. Outline. STREAM Streaming-Data Algorithms for High-Quality Clustering, by Liadan O'Callaghan, Nina Mishra, Adam Meyerson , Sudipto Guha , Rajeev Motawani , (ICDE) 2001. CluStream
E N D
Data Stream Clustering JiaqiGuand Hsuan Chiu
Outline • STREAM • Streaming-Data Algorithms for High-Quality Clustering, by Liadan O'Callaghan, Nina Mishra, Adam Meyerson, SudiptoGuha, Rajeev Motawani, (ICDE) 2001. • CluStream • A Framework for Clustering Evolving Data Streams, by Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu,(VLDB) 2003. • DenStream • Density-based Clustering over an Evolving Data Stream with Noise, F Cao, M Ester, W Qian, and A Zhou. (SDM'2006)
Data Stream (DS) Properties • No Random access, one pass only. • Limited Memory. • Summary information. • Big data set, linear scan only • Ex: router packet statistics, sensor network data.
Challenges of Clustering on DS • Target on K-Median clustering problem. • Given the DS properties, what’s the new challenges for clustering? Random access needed Not scalable
Goal of Clustering on DS • Efficient memory usage. • Linear data access.
STREAM Algorithm Cluster Output LSEARCH Data Input X3 X1 X2 LSEARCH
LSEARCH Algorithm • A variant of K-Median. • Given N points, k clusters expected. • Set Initial cluster result set C. • Randomly pick a set of sample points S from N as feasible medians (FM). • For each point p in S, replace p with one of cluster center c in current solution C if p is “better” than m. • What does “better” solution mean? • Lower cost.
Facility cost minimization • SSQ minimization problem. • Facility clustering (FC) cost minimization problem. • No restricting the number of median to be at most k. • Add facility cost z instead. • Why? • k need not be held fixed in the intermediate stages of STREAM. This constraint relaxation will reduce the running time. Intuition: just need to cluster near k groups.
LSEARCH Algorithm • Step 1. Set Initial solution. InitialSolution(data set N, facility cost z) 1. Reorder data points 2. The first point is the first cluster center, c1. 3. For each pi in N For each cjin Clusters Create new cluster ck with probability p=(distance(pj, ci)/z) 4. Output: Initial medians set C
LSEARCH Algorithm • Step 2. Randomly pick a set of sample points S as feasible medians (FM), where |S|= • They proved that only need to use this randomly chosen set S, LSEARCH still likely to choose good medians but will finish the computation sooner. • Similar Idea: CLARA
LSEARCH Algorithm • Step 3. For each point p in S, replace p with one of cluster center c in current solution C if pcontribute lower facility cost than m
LSEARCH Algorithm • So far, we did not mention about how to pick facility cost z. • The number of clusters in previous steps is not equal to k as we expected. If (the number of clusters |C| >> k) increase z else decrease z (Binary Search)
LSEARCH Algorithm Step 1 Step 2 Step 4: find z and cluster to k Step 3: replace with better solution.
Pros of STREAM • Guarantee cluster quality. • One pass data scan. • Running time: • Memory efficient: At ith point of the stream, only need to retain O(ik) points.
What about data evolution? • What is other property in data stream? • Temporal locality • What is the influence on the underlying clusters? • Clusters may change. • How to cluster over different time window?
Cons of STREAM • Cluster results are dominated by the outdated history of the stream. • Require a simultaneous maintenance of the intermediate results of clustering algorithms over all possible time horizons. • Computational burden increases.
CluStream Clustering Framework • Online micro clustering • Store summary statistics in a fast data stream. • Offline macro clustering • Uses summary statistics to cluster data. User specified time horizon.
Online micro clustering • How to store cluster information?
Online micro clustering • When to store as snapshot? • It is impossible to store the snapshots at each and every moment in time. • When a user initiate cluster request of past h time period, these snapshots should be able to provide useful information for clustering. • Solution: Pyramidal Time Frame.
Online micro clustering • Pyramidal Time Frame. 6 8 7 4 5 1 2 3 8 6 4 2 4 8 8
Online micro clustering • Initial q micro-clusters. • Offline k-means generation. • For each new data point p arrives • If the closest cluster Mp existed, add p to Mp • Else • create a new cluster Mnew • If delete old micro-cluster is safe, delete the micro-cluster with the least relevance time stamp • Else merge two micro-clusters and denote the merged cluster with an id list.
Offline macro clustering • No one-pass constraint. • Clustering is allowed over different time horizons. • Input: the number of clusters k, the time-horizon h. • It is possible to determine the approximate micro-clusters for a pre-specified time horizon by using two snapshots at pre-defined intervals.
Experiment • STREAM v.s. CluStream • Datasets. • KDD-CUP’99 Network Intrusion Detection stream dataset. (Detect intrusions in real time, rapidly evolving network intrusion data stream) • KDD-CUP’98 Charitable Donation dataset.(Dataset is stable distribution over time) • Metric • Accuracy: Average SSQ • Efficiency: Processing Rate.
Accuracy on evolve data STREAM clustering is based on the entire history of the data stream. STREAM has fewer summary information.
Accuracy on stable data STREAM clustering is based on the entire history of the data stream. STREAM has fewer summary information.
Efficiency on evolve data CluStream use k-means clustering at initial for computing micro-clusters. STREAM needs a few iterations to make k-means clustering converge. CluStream use existing micro-clusters to do further clustering.
Problems of CluStream • Distance Measurement • Lead to spherical clusters • Clusters with arbitrary shape • represented by all points in clusters • unrealistic in streaming algorithms • Maintain a fixed number of micro-clusters • risky due to noise in data streams
Problems of DBSCAN • Maintain all points in memory • limited memory in streaming application • Only handle a relatively stable environment • fast changing streams (one-pass, evolving)
Cluster Analysis in Streaming Environment • Requirements for stream clustering: • No assumption on the number of clusters • # cluster is unknown and changing • Discovery of clusters with arbitrary shape • For different applications • Ability to handle outliers • Influence of various factors
Overview of DenStream • Adopt two aspects of CluStream framework • Store summary information via fading function • Separate the process of clustering • Online: micro-cluster maintenance • Offline: generate clusters by a variant of DBSCAN • Contribution • Summarize the clusters with arbitrary shape • Adopt novel pruning strategy • Use an outlier-buffer
Ideas in DBSCAN • Core object • epsilon neighborhood the overall weight of data points is at least mu. • Density area • union of epsilon neighborhood of core objects. • Such a precise result is unrealistic • Approximation using core-micro-cluster
How to approximate? • Maintain number of c-micro-clusters • much larger than the number of “natural” clusters • significantly smaller than the number of points in data stream • Clusters denoted by a set of non-redundant c-micro-clusters • In evolving environment, using Potential c-micro-cluster, Outlier micro-cluster
Point Weight • Fading function f(t) • Each point has a weight associated with f(t) • , where λ denotes the decay rate • λ (↑), the importance of the historical data (↓) • Overall weight of data stream • tc (tc→∞): current time • v: the speed of stream, i.e., the number of points arrived in one unit time
Core-micro-cluster • c-micro-cluster CMC(w, c, r) at time t • A group of points with time stamps • Weight • Center • Radius • Euclidean distance • # of c-micro-cluster
Incremental Mining • Role of clusters and outliers often exchanges in an evolving data stream • Introduce potential c-micro-clusters and outlier micro-clusters for incremental computation • Similar to those used in CluStream • Main different is on weight • Maintain all outlier micro-clusters in a separate memory space, called outlier-buffer
Potential c-micro-cluster • p-micro-cluster at time t • A group of points with time stamps • Weight • , • Center • Radius
Outlier micro-cluster • o-micro-cluster at time t • A group of points with time stamps • are the same as p-micro-cluster • : creation time, • w < beta * mu
Memory Limit • Need a way to control number of o-micro-clusters • continuously increasing • Provide opportunity for o-micro-clusters to grow into p-micro-clusters • Control the number of both • pruning strategy
Clustering Algorithm • Online: incrementally maintain micro-clusters • Initialization • Apply the DBSCAN algorithm for first InitN points to create a group of p-micro-clusters • Prune p- and o-micro-clusters periodically • Offline: generate clusters by a variant of DBSCAN
Pruning Strategy for Cp • If no new point joins into cp, the weight of cp decay • If the weight of cp < ➔ delete it from memory • Check each cp at every Tp time periods • Minimal time span Tp for a cp fading into an outlier • Similar Strategy for cO
Generate Clusters • Each p-micro-cluster cp is regard as a virtual point • Each virtual point is located at the center of cp and associated with weight w • A variant of DBSCAN algorithm is applied to those virtual points in order to get meaningful clusters
Experimental Result EDS, h=2, v=2000 EDS, h=10, v=1000 Network Intrusion data, h=1, v=1000 Network Intrusion data, h=5, v=1000