1 / 24

Clustering over Multiple Evolving Streams by Events and Correlations

Clustering over Multiple Evolving Streams by Events and Correlations. Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007. Outline. Introduction Data Summarization Similarity Measurement

micol
Download Presentation

Clustering over Multiple Evolving Streams by Events and Correlations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007

  2. Outline • Introduction • Data Summarization • Similarity Measurement • COMET-CORE Framework • Empirical Studies • Conclusion

  3. Introduction (1) • Good clustering puts similar objects together and separates dissimilar ones into different clusters. • Useful information from clusters • Data collection in sensor networks • Stock market trades A B F E D G C

  4. A B F E D G C Introduction (2) • Online data summarization with offline clustering. • Periodical Online Clustering User Waste!! Lose Information!!

  5. Introduction (3) • COMET-CORE • Use online piecewise linear line segments to approximate original data • Update correlations when a stream encounters a new end point • Update clusters by the updated correlations Data point End point Update stream correlations

  6. Data Summarization (1) • Problem Model • Γ= {S1, S2, …, Sn} • Si = Si[1, …, t, …] : i-th stream • Si[t] : arriving data of Si at time t • Siapp[t] : approximated data of Si at time t • : end points summary of stream Si • The objective is that given a set of data streams Γ and the threshold parameters, stream clusters are monitored online.

  7. Data Summarization (2) • Approximation Line Formulation • For a sub-stream Si[ts,…,te] • The parameters : (te, Si[te]) (ts, Si[ts])

  8. Data Summarization (3) • Error Function • Error Threshold • It may not easy to give a proper absolute error threshold • Relative error threshold (EX: 2% error of square sum of original data stream)

  9. Data Summarization (4) • Online Linear Line Segment Approximation Value Error < Threshold δl Error > Threshold δl Generate New End Point Time tv1 tvk

  10. Similarity Measurement (1) • Use Pearson correlation as similarity measure Regard two streams as two different random variables

  11. Similarity Measurement (2) • Definition 4.2. Given two streams Si and Sj, and a weight function w(t), the weighted correlation coefficient between these two streams is defined as :

  12. Similarity Measurement (3) • Definition 4.3. Given two streams Si and Sj, and a weight function w(t), the WC vector of Si and Sj is defined as :

  13. Similarity Measurement (4) • Similarity Update • Update WC vector when a new end point generated • Linear scan of data streams  incremental update

  14. Similarity Measurement (5) . . .

  15. COMET-CORE Framework (1) • Definition 5.1. Assume that the centers of two clusters Ci and Cj are represented by end point sequence and , respectively. Then, the WC vector of two clusters denoted by is equal to . The weighted correlation between Ci and Cj denoted by wcorr(Ci, Cj) is equal to wcorr(Si, Sj) . • COMET-CORE A stream encounters a new end point Split Cluster Merge cluster

  16. Non-trigger streams New trigger groups Update Weighted Correlation Compare correlation betweennon-trigger stream and representative stream with δa Compare Correlation with δa COMET-CORE Framework (2) • Split cluster Ctmp Cnew1 Ck trigger streams Cnew2 Cnew3 Three new groups

  17. S13,S14 S13,S14 S13,S14 S11,S12 S11,S12 S11,S12 S4,S5 S4,S5 S4,S5 S6,S7 S6,S7 S6,S7 S1,S2,S3 S1,S2,S3 S1,S2,S3 COMET-CORE Framework (3) • Assign WC vectors to newly generated clusters • Type1: Ci and Cj are belong to the same cluster originally. • Type2: Ci and Cj are belong to different clusters originally. • Type3: Ci is newly generated cluster, Coo is originally existing one. C1 C11 Cx Cy S1, S2, S3, S4, S5, S6, S7 S11, S12, S13, S14 C11 C14 C11 C14 C11 C14 Cx Cy Cx Cy Cx Cy C4 C4 C4 C1 C6 C1 C6 C1 C6 (a)Type1: (b)Type2: (c)Type3:

  18. COMET-CORE Framework (4) • Merge Cluster • After splitting and updating the inter-cluster correlation • Two clusters are merged if the correlation ≥ δe until no this kind of cluster pairexists. wcorr(C1, C2)≥ δe Cnew C1 C2 Merge wcorr(Cnew , Ck) = min(wcorr(C2 ,Ck), wcorr(C2 ,Ck)) wcorr(C1, C2) wcorr(C2, Ck) Ck Ck

  19. Empirical Studies (1) • Clustering algorithms • Basic: periodically agglomerative clustering • ODAC: periodically hierarchical clustering • COMET-CORE All streams 2Dis(P) – (Dis(C1) + Dis(C2)) < Threshold Dissimilarity > Threshold Clustering Result

  20. Empirical Studies (2) • Clustering quality measurement • Silhouette Validation a(Si) is the average dissimilarity of stream Si to all other streams in the same cluster b(Si) is the average dissimilarity of stream Si to all other streams in the another closest cluster • Cluster Silhouette • Global Silhouette

  21. Empirical Studies (3) • Evaluation on Real Data • δa=δe = 0.5 Data Sets

  22. Empirical Studies (4) • Evaluation on Cylinder-Bell-Funnel Data Set • δa=δe = 0.8 • 100 streams for each type (total 600 streams) • normal distribution number ranges from 0 to 1 are randomly added on each streams 6 types 128 long

  23. Empirical Studies (5) • Evaluation on Random Walk Data Set • δa=δe = 0.7 • Period = 200 data points (Basic & ODAC) 1. Streams number 2. Cluster number Almost independent of cluster num 20000 Points in Each Stream Fixed 500 Streams

  24. Conclusion • The paper proposes a novel and efficient online clustering framework COMET-CORE for clustering over streams. • COMET-CORE uses efficient split and merge algorithm to modify clusters with good clustering quality.

More Related