1 / 32

Data Clustering Methods

Data Clustering Methods. Docent Xiao-Zhi Gao Department of Automation and Systems Technology. Data Clustering. Data clustering is for data organization, data compression, and model construction

nikita
Download Presentation

Data Clustering Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Clustering Methods Docent Xiao-Zhi Gao Department of Automation and Systems Technology

  2. Data Clustering • Data clustering is for data organization, data compression, and model construction • Clustering partitions a data set into groups such as similarity within a group is larger than that among groups • Similarity needs to be defined • Metric of difference between two input vectors

  3. Clusters in Data Data need to be normalized into a hypercube beforehand

  4. Similarity?

  5. Similarity • Similarity can be defined as distances between two vectors in the data space • There are a few choices • Euclidean distance (real-values) • Hamming distance (binary or symbols) • Manhattan distance (any)

  6. Euclidean Distance • Euclidean distance between two vectors is defined as:

  7. Hamming Distance • Hamming distance is the number of positions at which the corresponding symbols of two vectors are different • For example, • "toned" and "roses" is 3 • "1011101" and "1001001" is 2 • "2173896" and "2233796" is 3

  8. Manhattan Distance • Manhattan distance (city block distance) is equal to the length of all paths connecting the two vectors along all segments Taxicab geometry

  9. K-Means Clustering Method • K-means clustering method partitions a collection of n vectors into c groups Gi, i=1, 2, ..., c, and finds the cluster centers in these groups so as to minimize a given dissimilarity measurement

  10. K-Means Clustering Method • The dissimilarity measurement (cost function) can be calculated using Euclidean distance in K-means clustering method

  11. K-Means Clustering Method • The binary membership matrix U is cxn martrix defined as follows: • Xjbelongs to group i, if ci is the closest center among all the centers

  12. K-Means Clustering Method • To minimize the cost function J, the optimal center of a group should be the mean of all the vectors in that group:

  13. K-Means Clustering Method • K-means clustering method is an iterative algorithm to find cluster centers

  14. K-Means Clustering Method • There is no guarantee that it can converge to an optimal solution • Optimization methods might be used to deal with cost function J • The performance of k-means clustering method depends on the initial cluster centers • Front-end methods should be employed to find good initial centers

  15. K-Means Clustering Method • K-means clustering method might have problems with clusters of • different densities • non-globular shapes • K-means clustering method is a ’hard’ data clustering approach • Data should belong to clusters to degrees • Fuzzy k-means method

  16. Clusters of Different Densities c=3

  17. Clusters of Non-globular Shapes c=2

  18. Butterfly Data

  19. Mountain Clustering Method • Mountain clustering method (Yager, 1994) approximates clusters based on density measure of data • Mountain clustering method can be used either as a stand-alone algorithm or for obtaining initial clusters of other data clustering approaches

  20. Mountain Clustering Method • Step 1: Form a grid in the data space, and the intersections of the grid line are considered as center candidates of clustering, denoted as a set V • Not necessarily evenly spaced • A fine gridding is needed, but can increase computation burden

  21. Mountain Clustering Method • Step 2: Construct mountain functions representing data density measure. The height of the mountain function at v is:

  22. Mountain Clustering Method • Each input vector x contributes to the heights of mountain functions at v • The contribution is inversely proportional to their distances d(x, v) • Mountain function is a measure of data density (higher if more data points are located nearby)

  23. Mountain Clustering Method • Step 3: Select cluster centers and destruct mountain functions The points with the largest mountain heights are selected as cluster centers

  24. Mountain Clustering Method • The just-identified centers are often surrounded by input data with high density • The effects of just-identified centers should be eliminated • The mountain functions are revised by substracting a scaled Gaussian function

  25. Mountain Functions 0.02 0.1 0.2 may affect the smoothness of mountain functions

  26. Mountain Destruction Cluster centers are selected, and mountains are destructed sequentially

  27. Subtractive Clustering • Mountain clustering method is simple but time consuming with growth of dimensions of data • Replace grid points with data points in mountain clustering, and we can get subtractive clustering (Chiu, 1994) • Only data points are considered as cluster center candidates

  28. Subtractive Clustering • The density measure of data point • The density measure of each data point is revised sequentially

  29. Conclusions • Three typical off-line data clustering methods are introduced • They often operate in the batch mode • The prototypes characterizing data sets found by the data clustering methods can be used as ’codebooks’

  30. An Application Example

  31. Computer Exercises I

  32. Computer Exercises II

More Related