240 likes | 379 Views
Clustering over Multiple Evolving Streams by Events and Correlations. Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007. Outline. Introduction Data Summarization Similarity Measurement
E N D
Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007
Outline • Introduction • Data Summarization • Similarity Measurement • COMET-CORE Framework • Empirical Studies • Conclusion
Introduction (1) • Good clustering puts similar objects together and separates dissimilar ones into different clusters. • Useful information from clusters • Data collection in sensor networks • Stock market trades A B F E D G C
A B F E D G C Introduction (2) • Online data summarization with offline clustering. • Periodical Online Clustering User Waste!! Lose Information!!
Introduction (3) • COMET-CORE • Use online piecewise linear line segments to approximate original data • Update correlations when a stream encounters a new end point • Update clusters by the updated correlations Data point End point Update stream correlations
Data Summarization (1) • Problem Model • Γ= {S1, S2, …, Sn} • Si = Si[1, …, t, …] : i-th stream • Si[t] : arriving data of Si at time t • Siapp[t] : approximated data of Si at time t • : end points summary of stream Si • The objective is that given a set of data streams Γ and the threshold parameters, stream clusters are monitored online.
Data Summarization (2) • Approximation Line Formulation • For a sub-stream Si[ts,…,te] • The parameters : (te, Si[te]) (ts, Si[ts])
Data Summarization (3) • Error Function • Error Threshold • It may not easy to give a proper absolute error threshold • Relative error threshold (EX: 2% error of square sum of original data stream)
Data Summarization (4) • Online Linear Line Segment Approximation Value Error < Threshold δl Error > Threshold δl Generate New End Point Time tv1 tvk
Similarity Measurement (1) • Use Pearson correlation as similarity measure Regard two streams as two different random variables
Similarity Measurement (2) • Definition 4.2. Given two streams Si and Sj, and a weight function w(t), the weighted correlation coefficient between these two streams is defined as :
Similarity Measurement (3) • Definition 4.3. Given two streams Si and Sj, and a weight function w(t), the WC vector of Si and Sj is defined as :
Similarity Measurement (4) • Similarity Update • Update WC vector when a new end point generated • Linear scan of data streams incremental update
COMET-CORE Framework (1) • Definition 5.1. Assume that the centers of two clusters Ci and Cj are represented by end point sequence and , respectively. Then, the WC vector of two clusters denoted by is equal to . The weighted correlation between Ci and Cj denoted by wcorr(Ci, Cj) is equal to wcorr(Si, Sj) . • COMET-CORE A stream encounters a new end point Split Cluster Merge cluster
Non-trigger streams New trigger groups Update Weighted Correlation Compare correlation betweennon-trigger stream and representative stream with δa Compare Correlation with δa COMET-CORE Framework (2) • Split cluster Ctmp Cnew1 Ck trigger streams Cnew2 Cnew3 Three new groups
S13,S14 S13,S14 S13,S14 S11,S12 S11,S12 S11,S12 S4,S5 S4,S5 S4,S5 S6,S7 S6,S7 S6,S7 S1,S2,S3 S1,S2,S3 S1,S2,S3 COMET-CORE Framework (3) • Assign WC vectors to newly generated clusters • Type1: Ci and Cj are belong to the same cluster originally. • Type2: Ci and Cj are belong to different clusters originally. • Type3: Ci is newly generated cluster, Coo is originally existing one. C1 C11 Cx Cy S1, S2, S3, S4, S5, S6, S7 S11, S12, S13, S14 C11 C14 C11 C14 C11 C14 Cx Cy Cx Cy Cx Cy C4 C4 C4 C1 C6 C1 C6 C1 C6 (a)Type1: (b)Type2: (c)Type3:
COMET-CORE Framework (4) • Merge Cluster • After splitting and updating the inter-cluster correlation • Two clusters are merged if the correlation ≥ δe until no this kind of cluster pairexists. wcorr(C1, C2)≥ δe Cnew C1 C2 Merge wcorr(Cnew , Ck) = min(wcorr(C2 ,Ck), wcorr(C2 ,Ck)) wcorr(C1, C2) wcorr(C2, Ck) Ck Ck
Empirical Studies (1) • Clustering algorithms • Basic: periodically agglomerative clustering • ODAC: periodically hierarchical clustering • COMET-CORE All streams 2Dis(P) – (Dis(C1) + Dis(C2)) < Threshold Dissimilarity > Threshold Clustering Result
Empirical Studies (2) • Clustering quality measurement • Silhouette Validation a(Si) is the average dissimilarity of stream Si to all other streams in the same cluster b(Si) is the average dissimilarity of stream Si to all other streams in the another closest cluster • Cluster Silhouette • Global Silhouette
Empirical Studies (3) • Evaluation on Real Data • δa=δe = 0.5 Data Sets
Empirical Studies (4) • Evaluation on Cylinder-Bell-Funnel Data Set • δa=δe = 0.8 • 100 streams for each type (total 600 streams) • normal distribution number ranges from 0 to 1 are randomly added on each streams 6 types 128 long
Empirical Studies (5) • Evaluation on Random Walk Data Set • δa=δe = 0.7 • Period = 200 data points (Basic & ODAC) 1. Streams number 2. Cluster number Almost independent of cluster num 20000 Points in Each Stream Fixed 500 Streams
Conclusion • The paper proposes a novel and efficient online clustering framework COMET-CORE for clustering over streams. • COMET-CORE uses efficient split and merge algorithm to modify clusters with good clustering quality.