130 likes | 299 Views
Unsupervised pattern recognition models for mixed feature-type symbolic data. Francisco de A.T. de Carvalho *, Renata M.C.R. de Souza PRL, Vol.31 , 2010, pp. 430–443. Presenter : Wei- Shen Tai 20 10 / 3/10. Outline. Introduction
E N D
Unsupervised pattern recognition models for mixed feature-type symbolic data Francisco de A.T. de Carvalho *, Renata M.C.R. de Souza PRL, Vol.31, 2010, pp. 430–443. Presenter : Wei-Shen Tai 2010/3/10
Outline • Introduction • Dynamic clustering algorithms for mixed feature-type symbolic data • Cluster interpretation • Experimental evaluation • Conclusion remarks • Comments
Motivation • Partitioning dynamical cluster algorithms • None of these former dynamic clustering models are able to manage mixed feature-type symbolic data.
Objective • Dynamic clustering methods for mixed feature-type symbolic data based on suitable adaptive squared Euclidean distances. • To obtain a suitable homogenization of the mixed feature-type symbolic data into histogram-valued symbolic data prior to preprocessing.
Partitioning dynamical clustering • Iterative two-step relocation algorithms • Construct clusters and identify a suitable representation or prototype for each cluster at each iteration. • Optimize a criterion based on a measure of fitting between the clusters and their prototypes • Adaptive dynamic clustering algorithm • Those distances, that compare clusters and their prototypes, can be different from one cluster to another.
Data homogenization pre-processing • Set-valued and list-valued variables • An ordered list-valued variable • Interval-valued variables
Interval-valued variables • X1 is the minimum and the maximum of the gross national product (in millions) • The set of elementary intervals • Country 1 X1[10, 30] • I1=>l([10, 25] ∩[10, 30] ) / l([10, 30]) = 15/ 20 = 0.75 • I2=>l([25, 30] ∩[10, 30] / l([10, 30]) = 5/ 20 = 0.25 • Q2 = 0.75+0.25 = 1.0
Set-valued and list-valued variables • Set A2 = {A=agriculture, C=chemistry, Co=commerce, E=engineering, En=energy, I=information} • Country 1, X2={A, Co} • => {A, C, Co, E, En, I}(½, 0, ½, 0, 0, 0) • If A9 = {worst, bad, fair, good, best} • Country 1 A9=good • => (0, 0, 0, 1, 1)
Squared adaptive Euclidean distances • Single squared adaptive Euclidean distances (global) • The weight vector of each cluster where is the same for all clusters • Cluster squared adaptive Euclidean distances (local) • The weight vectors of each cluster is different from one cluster to another.
Algorithm schema • Pre-processing step: data homogenization • Initialization step: • Randomly choose a partition or randomly choose K distinct objects belonging to X. • Step 1: definition of the best prototypes • Determine the vector weight of each cluster for single squared adaptive Euclidean distances for all clusters. (global) • Step 2: definition of the best vector of weights • Determine the vector weight of each cluster for cluster squared adaptive Euclidean distances for each cluster. (local difference) • Step 3: definition of the best partition • Each Prototype (input ) finds the cluster with the closest distance, and update the vector weight of cluster’s representative prototype . • Stopping criterion • No prototype changes its belonged cluster.
Experimental results • Measurement of the quality of the results • Overall error rate of classification (OERC) • Corrected Rand (CR) • Let U ={u1,…. ui,….uR}and V ={v1,…. vj,….uC}be two partitions of the same data set having respectively R and C clusters. The corrected Rand index is:
Conclusions and remarks • Clustering for mixed feature-type symbolic data based on dynamic clustering methodology with adaptive distances. • It can recognize clusters of different shapes and sizes. • A solution for the best prototype of each cluster with the best adaptive distance for each cluster.
Comments • Advantage • This proposed framework provides a solution for mixed feature-type symbolic data clustering. • It also provides an alternative for the similarity measurement between cluster and input in categorical data via dynamic adaptive distance. • Drawback • If a categorical attribute possesses a larger value set, it will be the determinative attribute in the clustering after they were transformed to histogram. • The hierarchical relationship between categorical data is not considered in this method. • Application • Mixed feature-type data clustering.