200 likes | 221 Views
This paper introduces a novel and efficient clustering algorithm called WeiSC for clustering categorical data. The algorithm maximizes intra-cluster similarity and minimizes inter-cluster similarity, providing robust and efficient results. Experimental results on real-life datasets demonstrate the algorithm's effectiveness.
E N D
A novel clustering algorithm based on weighted support and its application Author : Xiang-Rong Yang Jun-Yi Shen Qlang Liu Graduate : Chien-Ming Hsiao
Outline • Motivation • Objective • Introduction • Description of some Terms • Algorithm and Analysis • Experimental results • Conclusions • Personal opinion
Motivation • Many efficient clustering algorithms have been proposed but most of these works focus on numerical data.
Objective • To present a novel and efficient algorithm WeiSC for clustering categorical data
Introduction • Clustering is an important KDD problem. • Objective : to group data into sets • Intra-cluster similarity is maximized • Inter-cluster similarity is minimized • Most of these works focus on numerical data whose inherent geometric properties can be exploited naturally to define distance functions between data points.
Introduction • The basic idea of WeiSC • It repeatedly read tuples from dataset one by one • When the first tuple arrives, it forms a cluster alone • The consequent tuples are either put into existing cluster or rejected by all existing clusters to form a new cluser by given similarity function defined between tuple and cluser. • Only makes one scan over the dataset
Description of some Terms • DEFINITION 1 • DEFINITION 2 • DEFINITION 3
Description of some Terms • DEFINITION 4 • DEFINITION 5
Algorithm and Analysis • Overview • Initially, the first tuple in the database is read and a cluster is constructed. • Then the consequent tuples are read iteratively. • The similarity between the new tuple and each existed clusters is computed according to • The similarity must be above the threshold, denoted as σ • When computing the similarity, we use the clusters’ summary instead of the clusters themselves, since the information needed contained in clusters’ summary
Computational complexities • The time and space complexities of the WeiSC algorithm depend on • The size of dataset (|D|) • The number of attributes (m) • The number of the clusters (p) , f (σ) • The size of each cluster, g(σ) • Time complexity O(|D| * m * f (σ)) • Space complexity O(|D| + m * f (σ) * g(σ))
Experimental results • The experimental results on the performance of WeiSC • Compare the clustering result with ROCK’s on the same data set
Quality of clustering results with real-life datasets • Mushroom dataset (real-life) • get from the UCI machine learning • Corresponding to 23 species of gilled mushrooms • Each species is identified as definitely edible, definitely poisonous • Has 21 attributes with 8124 tuples • The number of edible is 4208 • The number of poisonous is 3916
The effect of σ • The parameter of σ • Is the only parameter needed in WeiSC algorithm • Effects the results of clustering and the speed of algorithm • Can use the percentage of misclassified tuples as measure of the effect • Since the “edible” or “poisonous” has been labeled in each tuple
Conclusions • The WeiSC algorithm is robust and efficient • From inference and experimental • Read dataset only once • Used in IDS • Is speedy and deserves good efficiency
Personal Opinion • We can compare WeiSC algorithm with our algorithm.