480 likes | 858 Views
Fuzzy Clustering with Multiple Kernels. Naouel Baili. Multimedia Research Laboratory Computer Engineering & Computer Science Dept. University of Louisville, USA April 2010. Outline. Introduction Prototype-based Fuzzy Clustering Proposed Fuzzy C-Means with Multiple Kernels
E N D
Fuzzy Clustering with Multiple Kernels NaouelBaili Multimedia Research Laboratory Computer Engineering & Computer Science Dept. University of Louisville, USA April 2010
Outline • Introduction • Prototype-based Fuzzy Clustering • Proposed Fuzzy C-Means with Multiple Kernels • Preliminary results • Relational data Fuzzy Clustering • Proposed Relational Fuzzy C-Means with Multiple Kernels • Preliminary results • Conclusions
Inter-cluster distances are maximized Intra-cluster distances are minimized Introduction, what is clustering? • Clustering • The goal of clustering is to separate a finite unlabeled data set into a finite and discrete set of “natural,” hidden data structures; • As a data mining task, data clustering aims at the identification of clusters, or densely populated regions, according to some measurement or similarity function • Studied and applied in many fields • Statistics; • Spatial database; • Machine learning ; • Data mining.
Introduction, Data Clustering Methods • Hierarchical clustering • Organize elements into a tree, leaves represent objects and the length of the paths between leaves represents the distances between objects. Similar objects lie within the same sub-trees. • Partitional clustering • Organize elements into disjoin groups; • Hard vs. Fuzzy clustering • Kernel based clustering • Spectral clustering • Object data vs. Relational data clustering
Kernel methods: the mapping • A kernel • is a similarity measure • defined by an implicit mapping , • from the original space to a vector space (feature space) • such that: f f f Original Space Feature (Vector) Space
Benefits from Kernels • Generalizes (nonlinearly) pattern recognition algorithms in clustering, classification, density estimation, … • When these algorithms are dot-product based, by replacing the dot product by e.g.: linear discriminant analysis, logistic regression, perceptron, SOM, PCA, ICA, … • When these algorithms are distance-based, by replacing d(x,y) by k(x,x)+k(y,y)-2k(x,y) • Freedom of choosing implies a large variety of learning algorithms
Gaussian Kernel • Probably the most popular in practice • This kernel requires tuning for the proper value of σ. • Manual tuning (trial and error); • Brute force search: involve stepping through a range of values for σ, in a gradient ascent optimization, seeking optimal performance of a model with training data Although these approaches are feasible with supervised learning, it is much more difficult to tune σ for unsupervised learning methods.
Limitations, varying densities (1) Kernel-based clustering Original Data set Gaussian kernel, σ = 5 Still feasible but after several choices of σ Original Data set Gaussian kernel, σ = 8
Limitations, varying densities (2) • the success of a kernel-based clustering relies on the choice of the kernel function; • Often unclear which kernel is the most suitable for a particular task; • kernel–based clustering maps all points using the same global similarity. Original Data set Gaussian kernel, σ = 2 Gaussian kernel, σ = 5
Contributions (1) • Construct the kernel from a number of multi-resolution Gaussian kernels • And learn a resolution-specific weight for eachkernel in each cluster • Better characterization; • Density fitting; • Adaptivity to each individual cluster. Original Data set
Contributions (2) • Fuzzy C-Means with Multiple Kernels (FCM-MK) • Unsupervised • Object data • Prototype defined in the input space • Clusters with varying sizes and densities • Relational Fuzzy C-Means with Multiple Kernels (RFCM-MK) • Unsupervised • Relational data • Clusters of different shapes with unbalanced densities • Multiple-resolution within the same cluster
Part 1 – FCM-MK Part 1 – Prototype-based Clustering • Fuzzy C-Means with Multiple Kernels
Part 1 – FCM-MK Input, Output • Input: • Output:
Part 1 – FCM-MK Kernel-based Similarity • We construct a new kernel-induced similarity defined as • The normalized kernel is given by • The distance between point and center in feature space is
Part 1 – FCM-MK Objective function • Optimization of an “objective function” or “performance index”
Part 1 – FCM-MK Minimizing objective function (1) • Zeroing the gradient of with respect to • Zeroing the gradient of with respect to
Part 1 – FCM-MK Minimizing objective function (2) • We optimize with respect to the resolution-specific weights using the gradient descent method
Part 1 – FCM-MK Experimental Evaluation
Part 1 – FCM-MK Experimental evaluation, Data set 1
Part 1 – FCM-MK Experimental evaluation, Data set 1
Part 1 – FCM-MK Experimental evaluation, Data set 2
Part 1 – FCM-MK Experimental evaluation, Data set 2
Part 1 – FCM-MK Experimental evaluation, Data set 3
Part 1 – FCM-MK Experimental evaluation, Data set 3
Part 1 – FCM-MK Experimental evaluation, Data set 4
Part 1 – FCM-MK Experimental evaluation, Data set 4
Part 1 – FCM-MK Object Data vs. Relational Data Distribution of σ for Relational data Distribution of σ for Object data
Part 2 – RFCM-MK Part 2 – Relational Data Clustering • Relational Fuzzy C-Means with Multiple Kernels
Part 2 – RFCM-MK Input, Output • Input: • Output:
Part 2 – RFCM-MK Kernel-based Similarity • We construct a new kernel-induced similarity defined as • The relational data between feature points and with respect to cluster can be defined as
Part 2 – RFCM-MK Objective function • Optimization of an “objective function” or “performance index”
Part 2 – RFCM-MK Minimizing objective function (1) • Zeroing the gradient of with respect to
Part 2 – RFCM-MK Minimizing objective function (2) • We optimize with respect to the resolution-specific weights using the gradient descent method
Part 2 – RFCM-MK Experimental Evaluation
Part 2 – RFCM-MK Experimental Evaluation, Data set 1
Part 2 – RFCM-MK Experimental Evaluation, Data set 1
Part 2 – RFCM-MK Experimental Evaluation, Data set 2
Part 2 – RFCM-MK Experimental Evaluation, Data set 2
Part 2 – RFCM-MK Experimental Evaluation, Data set 3
Part 2 – RFCM-MK Experimental Evaluation, Data set 3
Conclusions • Find the optimal kernel-induced feature map in a completely unsupervised way • Multiple Kernel Learning; • Resolution-specific weight for each kernel base in each cluster; • Fuzzy C-Means with Multiple Kernels approach • Object data; • Clusters of different densities; • Prototypes in the input space. • Relational Fuzzy C-Means with Multiple Kernels approach • Relational data; • Multiple resolution within the same cluster or different clusters; • Clusters of different shapes.
Future Work (1) • Improve the performance of the proposed algorithms by using supervision information • Must-link constraints: The penalty for violating a must-link constraint between distant points should be higher than that between nearby points • Cannot-link constraints: The penalty for violating a cannot-link constraint between two points that are nearby according to the current metric should be higher than for two distant points
Future Work (2) • The objective function of Semi-Supervised Fuzzy C-Means with Multiple Kernel (SS-FCM-MK) is given by
Future Work (3) • The objective function of Semi-Supervised Relational Fuzzy C-Means with Multiple Kernel (SS-RFCM-MK) is given by • Study the performance versus the amount of supervision; • Compare to other semi-supervised algorithms.
Future Work (4) • Automatically identify the optimal number of clusters and reduce the effect of noise and outliers • Competitive Agglomeration: Starts with a large number of clusters to reduce the sensitivity to initialization, and determines the actual number of clusters by a process of competitive agglomeration; • Dave’s Noise Clustering technique (NC): seeks to separate noisy data by clustering them all into a c+1th conceptual cluster, based on the assumption that the center of such a cluster (called the noise cluster) is equidistant to all noise points in the data set.
Future Work (5) • The Nearest neighbor classifier (NNC) is used for many pattern recognition applications where the underlying probability distribution of the data is unknown a priori. • Traditional NNC stores all the known data points as labeled prototypes • computationally prohibitive for very large database; • limitation of computer storage; • cost of searching for the nearest neighbors of an input vector. • Apply our algorithms to real applications involving very large and high dimensional data • Content-Based Image Retrieval (CBIR); • Land mine detection using GPR.