210 likes | 264 Views
A Framework for Clustering Evolving Data Streams. Charu C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S. Yu Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB'03). 報告人 : 吳建良. Outline. Cluster analysis: A general overview Developed methodology Micro-cluster analysis and maintenance
E N D
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S. Yu Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB'03) 報告人:吳建良
Outline • Cluster analysis: A general overview • Developed methodology • Micro-cluster analysis and maintenance • Macro-cluster analysis • Evolution analysis • Empirical results
Cluster analysis: A general overview • What is cluster analysis?—Grouping a set of data objects into a set of clusters s.t. the intra-cluster similarity is high and the inter-cluster similarity is low • New requirements in stream clustering • Generate high-quality clusters in one scan • High quality, efficient incremental clustering • Analysis should take care of multi-dimensional space • Provide flexibility to compute clusters over user-defined time period
Developed methodology: Outline • Methodology • Divide the clustering process into online and offline components • Online: periodically stores summary statistics about the stream data • Micro-clustering: better quality than k-means • Online processing and maintenance • Pyramidal time window: register dynamic changes • Offline: answers various user queries based on the stored summary statistics
Clustering Feature: CF = (N, LS, SS) • N: Number of data points • LS: Ni=1=Xi • SS: Ni=1=Xi2 Clustering Feature Vector • Originated from BIRCH CF = (5, (16, 30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)
Micro-Clusters: Design Methodology • Data streams • Multi-dimensional points with time stamps T1, … Tk …. • Each point contains d dimensions, i.e., • A micro-cluster for n points is defined as a (2*d + 3) tuple: - the sum of the squares of the data values - the sum of the data values - the sum of the squares of the time stamps - the sum of the time stamps - the number of data points
Pyramidal Time Frame • Snapshots • The micro-clusters are also stored at particular moments in the stream • Classified into different frame number which can vary from 0 to log(T), where T is the clock time elapsed since the beginning of the stream • The frame number of a particular class of snapshots define the level of granularity in time at which the snapshots are maintained
Maintain Snapshot Frame Table • The Rules for insertion of a snapshot t into frame table • If (t mod αi)=0 but (t mod αi+1) ≠0, t is inserted into frame number i • Each slot has a max_capacity. If the slot has already reached its max_capacity, the oldest snapshot is removed and the new snapshot inserted • Example: • α= 2 • max_capacity =3
Micro-clusters Maintenance • The micro-clustering stage is online, statistical data collection – not dependant on user input • Initial creation of q micro-clusters M1 … Mq • Use k-means clustering algorithm • q is usually significantly larger than # of natural clusters • q is determined by the amount of available memory • Each micro-cluster is associated with a unique id when it is created
Incremental Update of Micro-clusters • When a new data point Xik arrives, it is either added to a micro-cluster, or a new micro-cluster is created • If Xik falls within the maximum boundary of its closest micro-cluster Mp, Xik is added to Mp • Maximum boundary: the RMS deviation of the data points in Mp from its centroid • RMS deviation: • Otherwise, a new micro-cluster is created for Xik
Incremental Update of Micro-clusters (Contd.) • Delete an old cluster or merge two closest clusters? • A micro-cluster is deleted whenever the average time stamp of the last m points is less than a given threshold • Otherwise, the two closest micro-cluster are merged by adding corresponding cluster feature vectors • An idlist is created for the two micro-clusters
Macro-Cluster Creation • Macro-clusters are created over a user-specified time horizon h • LetS(tc): the set of micro-clusters at time tc S(tc-h): the set of micro-clusters at time tc-h • The new set of micro-clusters N(tc-h) are created by subtractingS(tc-h) from S(tc) • Subtractive property • Let C1 and C2 be two sets of points such that Then
Macro-Cluster Creation (Contd.) • Each micro-cluster in N(tc-h) is treated as pseudo-point • Each pseudo-point has a weight proportional to the number of points inside it • A k-means clustering approach is applied to this set of pseudo-points in order to create a higher level of macro-clusters
Evolution Analysis of Micro-Clusters • In many case, it is desirable to find how the micro-clusters have changed over time • Given a user-specified time-horizon h and two clock times, t1 and t2 (where t1 < t2 ) • Analyze the evolution nature of data arriving between (t2–h, t2), and the data arriving between (t1–h, t1)
Evolution Analysis of Micro-Clusters (Contd.) • The following questions • Are there new clusters in the data at time t1 which were not present at time t2? • Find micro-clusters in N(t2-h) which are not present inN(t1-h) • Have some of the original clusters been lost? • Find micro-clusters in N(t1-h) which are not present inN(t2-h) • Have some of the original clusters at time t1, shifted in position and nature?
Empirical Result • Data sets • Real Data Sets: Network Intrusion and KDD Cup 98 data set (Charitable Donation) • Synthetic Data Sets: • Gaussian Distribution • Base Size: 100k ~ 1000k points • # Cluster: 4 ~ 64 • Dimensionality: 10 ~ 100
Cluster Quality (Network Intrusion) Horizon H=256, Stream_speed=200 Horizon H=1, Stream_speed=2000
Cluster Quality (Charitable Donation) Horizon H=16, Stream_speed=200 Horizon H=4, Stream_speed=2000
Scalability Stream_speed=2000
Sum of Square Distance (SSQ) • Assume there are a total N points in the past horizon H at current time Tc , where is the centroid of macro-cluster closest to pi
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K-means clustering algorithm 10 9 8 7 6 5 Update the cluster means Assign each points to closest center 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K points as initial cluster center Update the cluster means