1 / 37

Deepak Turaga 1 , Michalis Vlachos 2 , Olivier Verscheure 1

Deepak Turaga 1 , Michalis Vlachos 2 , Olivier Verscheure 1 1 IBM T.J. Watson Research Center, NY, USA 2 IBM Zürich Research Laboratory, Switzerland. On K-Means Cluster Preservation using Quantization Schemes. overview – what we want to do…. cluster 1. cluster 2. cluster 3.

Download Presentation

Deepak Turaga 1 , Michalis Vlachos 2 , Olivier Verscheure 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deepak Turaga1, Michalis Vlachos2, Olivier Verscheure1 1IBM T.J. Watson Research Center, NY, USA 2IBM Zürich Research Laboratory, Switzerland On K-Means Cluster Preservation using Quantization Schemes

  2. overview – what we want to do… cluster 1 cluster 2 cluster 3 cluster 1 cluster 2 cluster 3 k-Means k-Means original data quantized data • Examine under what conditions compression methodologies retain the clustering outcome • We focus on the K-Means algorithm identical clustering results

  3. why we want to do that… • Reduced Storage • The quantized data will take up less space

  4. why we want to do that… • Reduced Storage • The quantized data will take up less space • Faster execution • Since the data can be represented in a more compact form the cluster algorithm will require less runtime

  5. why we want to do that… • Reduced Storage • The quantized data will take up less space • Faster execution • Since the data can be represented in a more compact form the cluster algorithm will take less runtime • Anonymization/Privacy Preservation • The original values are not disclosed

  6. why we want to do that… • Reduced Storage • The quantized data will take up less space • Faster execution • Since the data can be represented in a more compact form the cluster algorithm will take less runtime • Anonymization/Privacy Preservation • The original values are not disclosed • Authentication • encode some message with the quantization We will achieve the above and still guarantee same results

  7. other cluster preservation techniques • We do not transform into another space • Space requirements same – no data simplification • Shape preservation original [Oliveira04] S. R. M. Oliveira and O. R. Zaane. Privacy Preservation When Sharing Data For Clustering, 2004 [Parameswaran05] R. Parameswaran and D. Blough. A Robust Data Obfuscation Approach for Privacy Preservation of Clustered Data, 2005 quantized

  8. K-Means Algorithm: • Initialize k clusters (k specified by user) randomly. • Repeat until convergence • Assign each object to the nearest cluster center. • Re-estimate cluster centers. k-means overview

  9. k-means example

  10. k-means applications/usage • Fast pre-clustering

  11. k-means applications/usage • Fast pre-clustering • Real-time clustering (eg image, video effects) • Color/Image segmentation

  12. k-means objective function Cluster centroid After some algebraic manipulations clusters 1st moment Dimensions/Time instances 2nd moment • Objective: Mininize sum of intra-class variance

  13. k-means objective function We maintain the cluster assignment clusters 1st moment Dimensions/Time instances 2nd moment So we can preserve the k-Means outcome if: We preserve the 1st and 2nd moment of the cluster objects

  14. moment preserving quantization • 1st moment: average • 2nd (central) moment : variance • 3rd moment: skewness • 4th moment: kyrtosis

  15. Everything below the mean valueis ‘snapped’ here Everything above the mean valueis ‘snapped’ here In order to preserve the first and second moment we will use the following quantizer:

  16. -0.4689 -0.4689 average = average = original quantized -2.4240 -0.2238 0.0581 -0.4246 -0.2029 -1.5131 -1.1264 -0.8150 0.3666 -0.5861 1.5374 0.1401 -1.8628 -0.4542 -0.6521 0.1033 -0.2206 -0.2790 -0.7337 -0.0645 -1.4795 0.2049 0.2049 0.2049 0.2049 -1.4795 -1.4795 -1.4795 0.2049 -1.4795 0.2049 0.2049 -1.4795 0.2049 -1.4795 0.2049 0.2049 0.2049 -1.4795 0.2049 = -1.4795 = 0.2049 Everything below the mean valueis ‘snapped’ here Everything above the mean valueis ‘snapped’ here

  17. These are the points for one dimension and for one cluster of objects. Dimension d (or time instance d) Process is repeated for all dimensions and for all clustersWe have one quantizer per class

  18. our quantization • One quantizer per class • The quantized data are binary

  19. our quantization • The fact the we have 1 quantizer per class suggests that we need to run k-Means once before we quantize • This is not a shortcoming of the technique as we need to know the cluster boundaries so that we know how much we can simplify the data.

  20. why quantization works? • Why does the clustering remain same before and after quantization? Centers do not change (averages remain same)

  21. why quantization works? • Why does the clustering remain same before and after quantization? Centers do not change (averages remain same) Cluster assignment does not change because clusters ‘shrink’due to quantization

  22. will it always work? • The results will be the same for datasets with well-formed clusters • Discrepancy of results means that clusters were not that dense

  23. recap clusters 1st moment Dimensions 2nd moment • Use moment preserving quantization to preserve objective function • Due to cluster shrinkage, cluster assignments will not change • Identical results for optimal k-Means • One quantizer per class • 1-bit quantizer per dimension

  24. example: shape preservation

  25. example: shape preservation

  26. example: shape preservation [Bagnall06] A. J. Bagnall, C. A. Ratanamahatana, E. J. Keogh, S. Lonardi, and G. J. Janacek. A Bit Level Representation for Time Series Data Mining with Shape Based Similarity. In Data Min. Knowl. Discov. 13(1), pages 11–40, 2006.

  27. example: cluster preservation Confusion Matrix • 3 years Nasdaq stock ticker data • We cluster into k=8 clusters

  28. 3% mislabeled dataafter the moment preserving quantization With Binary Clipping: 80% mislabeled 1 2 Cluster centers 3 4 5 6 7 8

  29. quantization levels indicate cluster spread

  30. example: label preservation 35 5 30 4 For rotation invariance we use a rotation invariant features 25 3 20 frequency space-time 15 2 10 1 5 0 0 0 20 40 60 0 10 20 30 • 2 datasets • Contours of fish • Contours of leaves • Clustering and then k-NN voting Acer platanoides Salix fragilis Tilia Quercusrobur

  31. example: label preservation • Very low mislabeling error for MPQ • High error rate for Binary Clipping

  32. other nice characteristics • Low sensitivity to initial centers • Mismatch when starting from different centers is around 7%

  33. other nice characteristics • Low sensitivity to initial centers • Mismatch when starting from different centers is around 7% • Neighborhood preservation • even though we are not optimizing directly that… • Good results because we are preserving the ‘shape’ of the object B A

  34. size reduction by a factor of 3when using the quantized scheme • Compression reduces for increasing K

  35. summary • 1-bit quantizer per dimension sufficient to preserve kMeans ‘as well as possible’ • Theoretically the results will be identical (under conditions) • Good ‘shape’ preservation Future work: • Multi-bit quantization • Multi-dimension quantization

  36. end..

More Related