1 / 22

SCLOPE: An Algorithm for Clustering Data Streams of Categorical Attributes

The SCLOPE algorithm is presented for clustering data streams with categorical attributes, addressing challenges like high dimensions and sparsity. It leverages the CluStream framework, utilizing pyramidal time frames and a distinct micro-macro clustering process.

Download Presentation

SCLOPE: An Algorithm for Clustering Data Streams of Categorical Attributes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCLOPE: An Algorithm for Clustering Data Streams of Categorical Attributes K.L Ong, W. Li, W.K. Ng, and E.P. Lim Proc. of the 6th Int. Conf. on Data Warehousing and Knowledge Discovery, Zaragoza, Spain, September 2004 (DaWak04) 報告人:吳建良

  2. Outline • Clustering a Data Stream of Categorical Attributes • SCLOPE Algorithm • CluStream Framework • CLOPE Algorithm • FP-Tree-Like Structure • Experimental Results

  3. Clustering a Data Stream of Categorical Attributes • Technically challenges • High dimensions • Sparsity problem in categorical datasets • Additional stream constraints • One pass I/O • Low CPU consumption

  4. SCLOPE Algorithm • Adopt two aspects of CluStream framework • Pyramidal time frame: store summary statistics at different time periods • Separate the process of clustering • Online micro-clustering component • Offline macro-clustering component • SCLOPE • Online: • Pyramidal time frame, FP-Tree-like structure • Offline: • CLOPE Clustering Algorithm

  5. CLOPE Clustering Algorithm • Cluster quality measure • Histogram Similarity • A larger height-to width ratio  better intra-cluster similarity H=S/W

  6. CLOPE Clustering Algorithm (cont.) Suppose Clustering C={C1, C2, …, Ck} • Height-to-width ratio • Gradient G(Ci)=H(Ci)/W(Ci)=S(Ci)/W(Ci)2 • Criterion function r: repulsion Control the level of intra-cluster similarity

  7. a b C1={T1} T2 is temporally added into C1 Profit=0.55 or Create a new cluster C2 Profit=0.41 a b c C1={T1, T2} Final Result: T3 is temporally added into C1 Profit=0.5 a b c d or C1={T1, T2 , T3} a b c a b c d d e f C2={T2} Create a new cluster C2 Profit=0.41 C1={T1, T2 , T3} C2={T4, T5} a c d C2={T3} CLOPE Clustering Algorithm (cont.) Initial Phase: C1 C1 C1 C2 C2

  8. CLOPE Clustering Algorithm (cont.) Iteration Phase: Repeat moved=false For all transaction t in the database move t to an existing cluster or new cluster Cj that maximize profit If Ci≠Cj then write <t, j> moved=true Until not moved

  9. Maintain Summary Statistics • Data streams • A set of records R1,…, Ri,… arriving at time stamps t1,…, ti,… • Each record R contains attributes A={a1, a2, …, aj} • A micro-cluster within time window (time tp ~ tq) is defined as a tuple : a vector of record identifiers : cluster histogram - width: - size: - height: size to width ratio

  10. FP-Tree-Like Structure • Drawbacks of CLOP • Multiple scans of the dataset • Multiple evaluations of the criterion function for each record • FP-tree-like structure • Require two scans of dataset • Determine the singleton frequency  • Insert each into FP-tree after arranging all attributes according to their descending singleton frequency • Share common prefixes • Without the need to compute clustering criterion

  11. b:2 c:1 a:3 d:1 c:1 null d:2 e:2 f:1 Construct FP-Tree-Like Structure Scan database once a:3, d:3, b:2, c:2, e:2, f:1 Arrange

  12. FP-Tree-Like Structure • Each path (from the root to a leaf node) is a micro-cluster • The number of micro-clusters are depend on the available memory space • Merge strategy • Select node which has longest common prefix • Select any two paths passing through the node • Merge its corresponding

  13. On beginning of (window wi) do 1: if (i=0) thenQ’ ←{a random order of v1,…,v∣A∣} 2: T ← new FP-tree and Q ←Q’ 3: for all (incoming record ) do 4: orderR according to Q and 5: if (R can be inserted completely along an existing path Pi in T) then 6: 7: else 8: Pj ← new path in T and ← new cluster histogram for Pj 9: Online Micro-clustering Component of SCLOPE

  14. On end of (window wi) do 10: L ← {<n, height(n)>: n is a node in T with > 2 children} 11: orderL according to height(n) 12: while do 13: select <n, height(n)> with largest value 14: select paths Pi, Pj where 15: 16: delete 17: output micro-clusters and cluster histograms Online Micro-clustering Component of SCLOPE

  15. Offline Macro-clustering Component of SCLOPE • Time-horizon h and repulsion r • h: span one or more windows • r: control the intra-cluster similarity • Profit function: clustering criterion • Each micro-cluster is treated as a pseudo-record • #Micro-cluster physical records • It takes less time to converge on the clustering criterion

  16. Experimental Results • Environment • CPU Pentium-4: 2GHz • RAM: 1GB • OS: Windows 2000 • Aspects: • Performance, scalability, cluster accuracy • Dataset • Real-world, synthetic data

  17. Performance and Scalability • Real-life data • FIMI repository (http://fimi.cs.helsinki.fi/data/)

  18. Performance and Scalability (cont.) • Synthetic data • IBM synthetic data generator Dataset: 50k records (a) 50 clusters (b) 100 clusters (c) 500 clusters

  19. #Attributes: 1000

  20. Cluster Accuracy • Mushroom data set • 117 distinct attributes and 8124 records • Two classes • 4208 edible , and 3916 poisonous • Purity metric • the average percentage of the dominant class label in each cluster

  21. Cluster Accuracy (cont.) Online micro-clustering component of SCLOPE SCLOPE v.s CLOPE

More Related