390 likes | 672 Views
Data Clustering Methods. Docent Xiao-Zhi Gao Department of Automation and Systems Technology. Data Clustering. Data clustering is for data organization, data compression, and model construction
E N D
Data Clustering Methods Docent Xiao-Zhi Gao Department of Automation and Systems Technology
Data Clustering • Data clustering is for data organization, data compression, and model construction • Clustering partitions a data set into groups such as similarity within a group is larger than that among groups • Similarity needs to be defined • Metric of difference between two input vectors
Clusters in Data Data need to be normalized into a hypercube beforehand
Similarity • Similarity can be defined as distances between two vectors in the data space • There are a few choices • Euclidean distance (real-values) • Hamming distance (binary or symbols) • Manhattan distance (any)
Euclidean Distance • Euclidean distance between two vectors is defined as:
Hamming Distance • Hamming distance is the number of positions at which the corresponding symbols of two vectors are different • For example, • "toned" and "roses" is 3 • "1011101" and "1001001" is 2 • "2173896" and "2233796" is 3
Manhattan Distance • Manhattan distance (city block distance) is equal to the length of all paths connecting the two vectors along all segments Taxicab geometry
K-Means Clustering Method • K-means clustering method partitions a collection of n vectors into c groups Gi, i=1, 2, ..., c, and finds the cluster centers in these groups so as to minimize a given dissimilarity measurement
K-Means Clustering Method • The dissimilarity measurement (cost function) can be calculated using Euclidean distance in K-means clustering method
K-Means Clustering Method • The binary membership matrix U is cxn martrix defined as follows: • Xjbelongs to group i, if ci is the closest center among all the centers
K-Means Clustering Method • To minimize the cost function J, the optimal center of a group should be the mean of all the vectors in that group:
K-Means Clustering Method • K-means clustering method is an iterative algorithm to find cluster centers
K-Means Clustering Method • There is no guarantee that it can converge to an optimal solution • Optimization methods might be used to deal with cost function J • The performance of k-means clustering method depends on the initial cluster centers • Front-end methods should be employed to find good initial centers
K-Means Clustering Method • K-means clustering method might have problems with clusters of • different densities • non-globular shapes • K-means clustering method is a ’hard’ data clustering approach • Data should belong to clusters to degrees • Fuzzy k-means method
Mountain Clustering Method • Mountain clustering method (Yager, 1994) approximates clusters based on density measure of data • Mountain clustering method can be used either as a stand-alone algorithm or for obtaining initial clusters of other data clustering approaches
Mountain Clustering Method • Step 1: Form a grid in the data space, and the intersections of the grid line are considered as center candidates of clustering, denoted as a set V • Not necessarily evenly spaced • A fine gridding is needed, but can increase computation burden
Mountain Clustering Method • Step 2: Construct mountain functions representing data density measure. The height of the mountain function at v is:
Mountain Clustering Method • Each input vector x contributes to the heights of mountain functions at v • The contribution is inversely proportional to their distances d(x, v) • Mountain function is a measure of data density (higher if more data points are located nearby)
Mountain Clustering Method • Step 3: Select cluster centers and destruct mountain functions The points with the largest mountain heights are selected as cluster centers
Mountain Clustering Method • The just-identified centers are often surrounded by input data with high density • The effects of just-identified centers should be eliminated • The mountain functions are revised by substracting a scaled Gaussian function
Mountain Functions 0.02 0.1 0.2 may affect the smoothness of mountain functions
Mountain Destruction Cluster centers are selected, and mountains are destructed sequentially
Subtractive Clustering • Mountain clustering method is simple but time consuming with growth of dimensions of data • Replace grid points with data points in mountain clustering, and we can get subtractive clustering (Chiu, 1994) • Only data points are considered as cluster center candidates
Subtractive Clustering • The density measure of data point • The density measure of each data point is revised sequentially
Conclusions • Three typical off-line data clustering methods are introduced • They often operate in the batch mode • The prototypes characterizing data sets found by the data clustering methods can be used as ’codebooks’