290 likes | 314 Views
This research introduces a Fuzzy k-Modes Algorithm for clustering large categorical data sets efficiently in data mining. The algorithm extends the k-means paradigm to handle categorical data, providing more detailed information for clustering. The study compares the efficiency and performance of the algorithm with conceptual k-means and hard k-modes algorithms using real and artificial data sets.
E N D
國立雲林科技大學National Yunlin University of Science and Technology A Fuzzy k-Modes Algorithm for Clustering Categorical Data Advisor:Dr. Hsu Graduate:Chien-Ming Hsiao Author:Zhexue Huang and Michael K. Ng
Outline • Motivation • Objective • Introduction • Notation • Hard and fuzzy k-means algorithms • Hard and fuzzy k-Modes algorithms • Experimental Results • Conclusions • Personal Opinion
Motivation • Working only on numeric data limits the use of these k-means-type algorithms in data mining. • Most algorithms for clustering categorical data suffer from a common efficiency problem when applied to massive categorical-only data sets.
Objective • To tackle the problem of clustering large categorical data sets in data mining
Introduction • Fuzzy versions of k-means algorithm • Each pattern is allowed to have membership functions to all clusters. • Working only on numeric data limits the use of these k-means-type algorithms in such areas data mining.
Introduction • To cluster categorical data methods • the k-means algorithm [Ralambondrainy, 1995] • hierarchical clustering methods [Gower, 1991] • the PAM algorithm [Kaufman et al, 1990] • the fuzzy-statistical algorithms [Woodbury, 1974] • The conceptual clustering methods [Michalski, 1983]
Notation • The set of objects to be clustered is stored in a database table T defined by a set of attributes A1, A2,…, Am.
Hard and fuzzy k-means algorithms • Let X be a set of n objects described by m numeric attributes.
Hard and fuzzy k-means algorithms • The usual method toward optimization of F is to use partial optimization for Z and W • fix Z and find necessary conditions on W to minimize F • Fix W and minimize F with respect to Z
Hard and fuzzy k-means algorithms • Theorem 1 • Let be fixed and consider Problem (P1)
Hard and fuzzy k-means algorithms • Theorem 2 • Let be fixed and consider Problem (P2)
Hard and fuzzy k-means algorithms • The complexity of the algorithm • O(tkmn) • The space of the algorithm • O(n(m+k) + km)
Hard and fuzzy k-Modes algorithms • Using a simple matching dissimilarity measure for categorical objects • Replacing the means of clusters with the modes • Using a frequency-based method to find the modes
Hard and fuzzy k-Modes algorithms • Let X and Y be two categorical objects • X = • Y = • The simple matching dissimilarity measure between X and Y is defined as follows:
Hard and fuzzy k-Modes algorithms • Using a frequency-based method to update Z • The Hard k-modes Update Method • The Fuzzy k-modes Update Method
Hard and fuzzy k-Modes algorithms • Theorem 3 : The Hard k-modes Update Method • The category of attribute Aj of the cluster mode Zl is determined by the mode of categories of attribute Aj in the set of objects belonging to cluster l • the quantity
Hard and fuzzy k-Modes algorithms • Theorem 4 : The Fuzzy k-modes Update Method • The category of attribute Aj of the cluster mode Zl is given by the category that achieves the maximum of the summation of wli to cluster l over all categories. • the quantity
Hard and fuzzy k-Modes algorithms • Theorem 5
Experimental Results • To evaluate the performance and efficiency of the fuzzy k-modes algorithm • To compare the fuzzy k-modes algorithm with the conceptual k-means algorithm and the hard k-modes algorithm • Use real and artificial data • Soybean disease data set.
Conclusions • Introduced the fuzzy k-modes algorithm for clustering categorical objects based on extensions to the fuzzy k-means algorithm. • The consequence of Theorem 4 that allows the k-means paradigm to be used in generating the fuzzy partition matrix from categorical data
Personal Opinion • The fuzzy partition matrix provides more information to help the user to determine the final clustering and to identify the boundary objects