350 likes | 864 Views
CMune : A CLUSTERING USING MUTUAL NEAREST NEIGHBORS ALGORITHM. Agenda. Introduction & Related Work 2. Cmune (Clustering Using Mutual Nearest Neighbors) Algorithm, Underlying Data Model and Complexity Analysis. 3. Experimentation & Validation
E N D
Agenda • Introduction & Related Work 2. Cmune (Clustering Using Mutual Nearest Neighbors) Algorithm, Underlying Data Model and Complexity Analysis. 3. Experimentation & Validation (9 data sets – with feature space dimensionality varying from 4 up to 5000). Six well known validity indices have been used. 4. Conclusion
CL US TER ING • Clustering of data is an important step in data analysis. The main goal of clustering is to partition data objectsinto well separated groups so that objects lying in the same group are more similar to one another than to objects in other groups. • Clusters can be described in terms of internal homogeneity and external separation.
Natural Clusters A natural cluster is a cluster of anyshape,Size and density, and it should not be restricted to a globular shape as a wide number of classical algorithms assume, or to a specific user-defined density as some density-based algorithms require.
Cluster Prototype • Many clustering algorithms adopt the notion of a prototype (i.e. data point that is a representative for a set of points) • This prototype can be: • only one data point such as in K-means, K-medoids, DBScan [Martin Ester, 1996], and Cure[Guha et al, 1998]. • A set of points representing tiny clusters to be merged/ propagated as in Chameleon [Karypis G, 1999]/ CMune.
Cluster Representation Reference Block Cmune (Abbas & Shoukry, 2012) Chameleon (Karypis G, 1999) Cure (Guha et al., 1998) DBScan (Martin Ester, 1996) K-medoid (Kaufman & Rousseeuw 1987) . K-means (Forgy, 1965) CMune relies on the principle of K-Mutual Nearest-Neighbor consistency
K- Mutual Nearest Neighbors versus K-Nearest Neighbors Consistency • Principle of K-NB consistency of a cluster[1]: “An object should be in the same cluster as its nearest neighbors”. • Principle of K-Mutual Nearest-Neighbor consistency (K-MNB consistency): “An object should be in the same cluster as its mutual nearest neighbors”. K-MNB consistency is stronger than K-NB consistency (i.e. K-MNB consistency implies K-NB consistency). [1] Lee, J.-S. and Ólafsson, S. (2011). Data clustering by minimizing disconnectivity. Inf. Sci, 181(4):732--746.
CMune Concept of Mutual Nearest Neighbors “A” is in 4-NB of “B”, however, “B” is not in 4-NB of “A”. Therefore, “A “and “B” are not Mutual Nearest Neighbors. Mutual Nearest Neighborhood is not a symmetric relation.
Reference Point and Reference Block/ List Reference Point ‘A’ and the Reference List RL(A) it represents. RL(A) consists of points that are Mutual Nearest Neighbors to point ‘A’. It is constructed from the intersection of the set points in K-NB(A) and the set of points having A in their K-Nearest Neighborhood.
Role of Reference Blocks/ Lists • They are considered as dense regions/blocks. • These blocks are the seeds from which clusters may grow up. Therefore, CMune is not a point-to-point clustering algorithm. Rather, it is a block-to-block clustering technique. • Much of its advantages come from these facts: Noise points and outliers correspond to blocks of small sizes, and homogeneous blocks highly overlap.
Type Of Representative Points • Given a Reference List RL(A), A is said to represent RL(A). There are three possible types of representative points: 1)StrongPoints representing blocks of size greater than a pre-defined threshold parameter. 2) Noise points, representing empty Reference-Lists (neither can form clusters nor can participate in clusters growing). 3) Weak points which are neither strong points nor noise points. These points may be merged with other clusters if they are members of other strong points Blocks.
Clusters Merging/ Propagation Cigreedilychooses Clto merge with, as they have the maximum mutual intersection among all overlapping reference blocks.
How to Impose An Order On the Merging Of Reference Blocks? Answer: By Cardinality (Density) and Homogeneity Homogeneity Factor α = pi qj (a) (b) (c) 3 cases with different homogeneity factors: (a) 0.98, (b) 0.80 and (c) 0.55
CMune • Initialize parameters: • K {size of neighbourhood of a point} • T { Noise threshold/min size of a reference list} • Construct similarity matrix based on Euclidean distance • Construct the reference list for each point pi: RL(pi)= K_NB(pi) RB(pi) • Form a sorted list L based on the cardinality of the reference lists RL(pi) • Exclude weak points for which RL(pi) < T
Sort w.r.t homogeneity Noise and Weak Points are excluded First Cluster Test closeness to existing clusters Cluster Merging
Experimental Results • Results were assessed using: (100 experiments / data set / algorithm) • 1- V-measure • 2- F-measure • 3- Adjusted Rand Index • 4- Jaccard Coefficient • 5- Purity • 6- Entropy • Eight data sets: • Iris data 5. E coli data • OCR data 6. Yeast data • Time series data 7. Libras movement data • Letter recognition data 8. Gisettedata Higher Index Value Indicates better Accuracy Lower Index Values corresponds to better Accuracy
Experimental Results • 100 experiments / data set / algorithm are conducted to determine the best values of T & K • CMUNE is compared to 4 state-of the art clustering techniques: • K-means (Forgy, 1965) • DBScan ( Martin Ester, 1996) • Mitosis (Noha Yousri, 2009) • Spectral Clustering (Chen, W.-Y., Song, Y., 2011)
Iris dataset This is a well known database found in the pattern recognition literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. Index Type
Time Series Dataset The Synthetic Control Charts (SCC) data set includes 600 patterns, each of 60 dimensions (time points). Index Type In general, Cmune has better indices values.
Optical Character Recognition dataset Consists of 5620 patterns, each of 64 dimensions. The features used to describe the bitmaps of the digit characters. The aim is to properly classify the digit characters to 10 classes from 0 to 9. Index Type In general, Cmune has better indices values.
Ecoli Dataset This data contains protein localization sites. Consists of 336patterns, each of 8 dimensions Index Type In general, Cmune has better indices values.
Libras Movement Data set The dataset contains 15 classes of 24 instances each, where each class references to a hand movement type. Consists of 360 patterns, each of 91 dimensions. Index Type In general, Cmune has better indices values.
Yeast Dataset The Protein Localization Sites, Yeast obtained from the UCI repository. Consists of 1484 patterns, each of 8 dimensions. Index Type In general, Cmune has better indices values.
Letter Recognition Data Set The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. Consists of 20000 patterns, each of 16 attributes. Index Type In general, Cmune has better indices values.
SPECT Heart Dataset* The dataset describes diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images. Consists of 267 patterns, each of 22 dimensions. Each of the patients is classified into two categories: normal and abnormal. Index Type In general, Cmune has better indices values.
Breast Cancer Diagnostic Dataset* Consists of 569 patterns, each of 30 dimensions. The aim is to classify the data into two Diagnosis (malignant and benign). Index Type In general, Cmune has better indices values.
Gisette Dataset GISETTE is a handwritten digit recognition problem. The problem is to separate the highly confusible digits '4' and '9'. This dataset is one of five datasets of the NIPS 2003 feature selection challenge. Consists of 13500 patterns, each of 5000 dimensions. Index Type In general, Cmune has better indices values.
Conclusion • We present a novel clustering algorithm based on mutual nearest neighbor concept. It can find clusters of varying shapes, sizes and densities; even in the presence of noise and outliers and in high dimensional spaces as well. • Clusters are represented by reference blocks (points + list). Two clusters can be merged if their link strength is maximal (i.e. reference blocks have max. intersection). Any data point not belonging to a cluster is considered as noise. • The results of our experimental study on several data sets are encouraging. CMune solutions have been found, in general, superior to those obtained by DBScan, K-means and Mitosis and competitive with spectral clustering algorithm. • We intend to parallelize our algorithm as its clustering propagation is inherently parallel & determine T through some statistical analysis. • Algorithm is publicly available to other researchers at http://www.csharpclustering.com.
CSHARP Evolution • CSHARP [Abbas, Shoukry & Kashef, 2012] is presented for the purpose of finding clusters of arbitrary shapes and arbitrary densities in high dimensional feature spaces. It can be considered as a variation of the Shared Nearest Neighbor algorithm (SNN) (Ertoz, 2003). • Then a modified version of CSHARP is presented [Abbas & Shoukry, 2012]. The modification includes the incorporation of a new measure of cluster homogeneity. • In this paper, an enhanced version of Modified CSHARP is presented. Specifically, the number of parameters has been reduced from three to only two parameters to reduce the effort needed to select the best parameters.
Algorithm Complexity The overall time complexity for the CMune is : Where N is the number of data points and K is the number of nearest neighbors. CMune takes a space complexity of where N is the number of data points K and is the number of nearest neighbors used; since only the K-nearest neighbors of each data point is required.
Speed of CMune compared to (a) DBScan and (b) K-means and DBScan using letter recognition data set. Speed Performance Cmune Speed Performance is “Stable” Data Size