A Framework for Clustering Evolving Data Streams

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S. Yu Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB'03) 報告人:吳建良

Outline • Cluster analysis: A general overview • Developed methodology • Micro-cluster analysis and maintenance • Macro-cluster analysis • Evolution analysis • Empirical results

Cluster analysis: A general overview • What is cluster analysis?—Grouping a set of data objects into a set of clusters s.t. the intra-cluster similarity is high and the inter-cluster similarity is low • New requirements in stream clustering • Generate high-quality clusters in one scan • High quality, efficient incremental clustering • Analysis should take care of multi-dimensional space • Provide flexibility to compute clusters over user-defined time period

Developed methodology: Outline • Methodology • Divide the clustering process into online and offline components • Online: periodically stores summary statistics about the stream data • Micro-clustering: better quality than k-means • Online processing and maintenance • Pyramidal time window: register dynamic changes • Offline: answers various user queries based on the stored summary statistics

Clustering Feature: CF = (N, LS, SS) • N: Number of data points • LS: Ni=1=Xi • SS: Ni=1=Xi2 Clustering Feature Vector • Originated from BIRCH CF = (5, (16, 30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)

Micro-Clusters: Design Methodology • Data streams • Multi-dimensional points with time stamps T1, … Tk …. • Each point contains d dimensions, i.e., • A micro-cluster for n points is defined as a (2*d + 3) tuple: - the sum of the squares of the data values - the sum of the data values - the sum of the squares of the time stamps - the sum of the time stamps - the number of data points

Pyramidal Time Frame • Snapshots • The micro-clusters are also stored at particular moments in the stream • Classified into different frame number which can vary from 0 to log(T), where T is the clock time elapsed since the beginning of the stream • The frame number of a particular class of snapshots define the level of granularity in time at which the snapshots are maintained

Maintain Snapshot Frame Table • The Rules for insertion of a snapshot t into frame table • If (t mod αi)=0 but (t mod αi+1) ≠0, t is inserted into frame number i • Each slot has a max_capacity. If the slot has already reached its max_capacity, the oldest snapshot is removed and the new snapshot inserted • Example: • α= 2 • max_capacity =3

Micro-clusters Maintenance • The micro-clustering stage is online, statistical data collection – not dependant on user input • Initial creation of q micro-clusters M1 … Mq • Use k-means clustering algorithm • q is usually significantly larger than # of natural clusters • q is determined by the amount of available memory • Each micro-cluster is associated with a unique id when it is created

Incremental Update of Micro-clusters • When a new data point Xik arrives, it is either added to a micro-cluster, or a new micro-cluster is created • If Xik falls within the maximum boundary of its closest micro-cluster Mp, Xik is added to Mp • Maximum boundary: the RMS deviation of the data points in Mp from its centroid • RMS deviation: • Otherwise, a new micro-cluster is created for Xik

Incremental Update of Micro-clusters (Contd.) • Delete an old cluster or merge two closest clusters? • A micro-cluster is deleted whenever the average time stamp of the last m points is less than a given threshold • Otherwise, the two closest micro-cluster are merged by adding corresponding cluster feature vectors • An idlist is created for the two micro-clusters

Macro-Cluster Creation • Macro-clusters are created over a user-specified time horizon h • LetS(tc): the set of micro-clusters at time tc S(tc-h): the set of micro-clusters at time tc-h • The new set of micro-clusters N(tc-h) are created by subtractingS(tc-h) from S(tc) • Subtractive property • Let C1 and C2 be two sets of points such that Then

Macro-Cluster Creation (Contd.) • Each micro-cluster in N(tc-h) is treated as pseudo-point • Each pseudo-point has a weight proportional to the number of points inside it • A k-means clustering approach is applied to this set of pseudo-points in order to create a higher level of macro-clusters

Evolution Analysis of Micro-Clusters • In many case, it is desirable to find how the micro-clusters have changed over time • Given a user-specified time-horizon h and two clock times, t1 and t2 (where t1 < t2 ) • Analyze the evolution nature of data arriving between (t2–h, t2), and the data arriving between (t1–h, t1)

Evolution Analysis of Micro-Clusters (Contd.) • The following questions • Are there new clusters in the data at time t1 which were not present at time t2? • Find micro-clusters in N(t2-h) which are not present inN(t1-h) • Have some of the original clusters been lost? • Find micro-clusters in N(t1-h) which are not present inN(t2-h) • Have some of the original clusters at time t1, shifted in position and nature?

Empirical Result • Data sets • Real Data Sets: Network Intrusion and KDD Cup 98 data set (Charitable Donation) • Synthetic Data Sets: • Gaussian Distribution • Base Size: 100k ~ 1000k points • # Cluster: 4 ~ 64 • Dimensionality: 10 ~ 100

Cluster Quality (Network Intrusion) Horizon H=256, Stream_speed=200 Horizon H=1, Stream_speed=2000

Cluster Quality (Charitable Donation) Horizon H=16, Stream_speed=200 Horizon H=4, Stream_speed=2000

Scalability Stream_speed=2000

Sum of Square Distance (SSQ) • Assume there are a total N points in the past horizon H at current time Tc , where is the centroid of macro-cluster closest to pi

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 K-means clustering algorithm 10 9 8 7 6 5 Update the cluster means Assign each points to closest center 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K points as initial cluster center Update the cluster means

A Framework for Clustering Evolving Data Streams

A Framework for Clustering Evolving Data Streams

Presentation Transcript

Semantic Streams: a Framework for Composable Semantic Interpretation of Sensor Data

A Distributed Clustering Framework for MANETS

A Probabilistic Framework for Semi-Supervised Clustering

Clustering Data Streams

Clustering Data Streams

A Distributed Framework for Collaborative Annotation of Streams

A Scalable Framework for the Collaborative Annotation of Live Data Streams

Data Streams

Data Streams

HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

Algorithms for Data Streams

A Framework for Clustering Evolving Data Streams

SCALE: a scalable framework for efficiently clustering transactional data

Clustering over Multiple Evolving Streams by Events and Correlations

Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

A Secure Clustering Algorithm for Distributed Data Streams

A General Framework for Mining Massive Data Streams

Data Mining for Data Streams