240 likes | 253 Views
Parameter Reduction for Density-based Clustering on Large Data Sets. Elizabeth Wang. Input Parameters of DBSCAN. The neighborhood radius, r. The minimum number, k, of neighbors to be a core point – the seed for expansion
E N D
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang
Input Parameters of DBSCAN • The neighborhood radius, r. • The minimum number, k, of neighbors to be a core point – the seed for expansion The clustering quality highly depends on the input parameters, especially r.
Content • Introduction • Related Work • Parameter Reduction for Density-based Clustering • Experiments and Observations • Determination of the neighborhood radii • Nonparametric Density-based Clustering • Performance Analysis • Conclusion
Introduction • Density-based clustering is widely used in various spatial applications such as geographical information analysis, medical applications, sky data observatories and satellite image analysis. • In density-based clustering, clusters are dense areas of points in the data space that are separated by areas of low density (noises) • One of the problems of density-based clustering is minimal domain knowledge to determine the input parameters.
Our Approach to Solve the Problem • We explore an automatic approach to determine the minimum neighborhood radii based on data distribution of the dataset. • The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters based on many experiments and observations. • We combine MINR with the enhanced DBCSCAN, e-DBCSCAN, into a nonparametric density-based clustering algorithm (NPDBC).
Problem with Input Parameter • In the density based clustering, the main input is the minimum neighborhood radius. • A dataset may consist of clusters with same density or different densities. Figure shows some possible distributions of X. • When clusters are in different densities, it is more difficult to determine the minimum neighborhood radii.
Related Work • Clustering methods • Attempts to reduce parameters • Enhanced DBSCAN clustering
Clustering Methods • There are mainly two clustering methods: • similarity-based partitioning methods • density-based clustering methods. • A similarity-based partitioning algorithm breaks a dataset into k subsets, called clusters. • The major problems with partitioning methods are: • k has to be predetermined; • it is difficult to identify clusters with different sizes; • it only finds convex clusters.
Clustering Methods (Cont.) • Density-based clustering methods are used to discover clusters with arbitrary shapes. The most typical algorithm is DBSCAN [1]. • DBSCAN is very sensitive to input parameters, which are the neighborhood radius (r) and a minimum number of neighbors (MinPts). • Another density-based algorithm DenClue uses a grid and is very efficient. This algorithm generalizes some other clustering approaches which, however, results in a large number of input parameters.
Attempts To Reduce Parameters • There have been many efforts to make clustering process parameter-free, such as OPTICS [8], CHAMELEON [7] and TURN*[2]. • OPTICS computes an augmented cluster ordering, which represents the density-based clustering structure of the data. This method is used for interactive cluster analysis. • CHAMELEON has been found to be very effective in clustering convex shapes. However, the algorithm cannot handle outliers and needs parameter setting to work effectively. • TURN* is a brute force approach. It first decreases the neighborhood radius to so small that every data point becomes noise. Then the radius is doubled each time to do clustering until it finds a “turn.” Even though it chooses big steps, the computation time is not promising for large datasets with various densities.
Parameter Reduction • DBSCAN have two input parameters: • the minimum number of neighbors, k, • the neighborhood radius, r. • k is the size of the smallest cluster. DBSCAN set k to 4 [1]. TURN* also treats it as a fixed value [2]. The clustering comparison between k= 4 and k=7 are shown in the next slide. • Therefore, the only input parameter is the minimum neighborhood radius, r. • r depends on data distribution of the dataset. It should be different for different density clusters.
Observation 1 • Define R as a distance between each point x and its 4th nearest neighbor. • The points are then sorted based on R in ascending order. DS1 is a dataset used by DBSCAN. The data size is 200. DS2 is reproduced from a dataset used by CHAMELEON. (a) DS1 (b) DS2 (c) R-x of DS1 (d) R-x of DS2
Observation 2 • Given a neighborhood radius r, we calculate the number of neighbors for each point within the given radius, denoted as K • Sort the points in descending order, and get the sorted K-x graph. • When the neighborhood radius is close to the maximum R, the K-x graph shows “knees” very clearly.
Observation 2 (Cont.) • In order to find the “knees” in the graph, we calculate the differentials of the graphs, Ks. • The knees are close to the points with peak differentials. • The number of “knees” is equal to the number of cluster densities in the dataset. • Intuitively, we infer that the points divided by “knees” belong to different density clusters or noise.
Observation 3 • Sort the dataset DS2 based on K, and then partition the sorted dataset into three subsets separated by two “knees.” • The two “knees” are at positions of 10000 and 15500. Therefore the three partitions are 0 – 10000, 10001 – 15500, and 15501-17500.
Determination of the neighborhood radii • Based on the experiments, we develop an algorithm to automatically determine the minimum neighborhood radii for mining clusters with different densities, MINR, based on the data distribution.
Nonparametric Density-based Clustering • Given a series of radii, start clustering using the enhance DBSCAN algorithm, e-DBSCAN, with the smallest r = r1. The densest cluster(s) would be formed. • Then set r = r2. Only process those unclustered points. The next sparser cluster(s) are formed.
Nonparametric Density-based Clustering • Calculates a series of neighborhood radii for different density clusters using MINR • Iterative clustering process using e-DBSCAN with the radii. • Merge any pair of clusters which share most of the boundary points of either cluster.
Performance Analysis • We compare our nonparametric density-based clustering algorithm (NPDBC) with the performance of TURN*. • We show the run time comparisons on the dataset, DS2, we discussed above. • In order to make the data contain the clusters in different densities, we artificially insert more data in some clusters to make them denser than the others. • The resulted datasets have the sizes from 10k to 200k.
Performance Analysis (Cont.) • NPDBC is more efficient than TURN* for large datasets. • The reason is that for NPDBC, the parameters are computed once at the beginning of the clustering process • TURN* algorithm tries different neighborhood radii until the first “turn” is found in case of two different densities. • We only compare NPDBC with TURN* on datasets with two different densities. If the density variety increases, NPDBC will outperform TURN* much more.
Conclusion and Future Work • In this paper, we explore an automatic approach to determine this parameter based on the distribution of datasets. • The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters. • We developed a nonparametric clustering method (NPDBC) by combining MINR with the enhanced DBCSCAN, e-DBCSCAN. • Experiments show our NPDBC is more efficient and scalable than TURN* for clusters in two different densities. • In our future work, we will implement our NPDBC using the vertical data structure, P-tree, the efficient data mining ready data representation.