1 / 20

A novel clustering algorithm based on weighted support and its application

This paper introduces a novel and efficient clustering algorithm called WeiSC for clustering categorical data. The algorithm maximizes intra-cluster similarity and minimizes inter-cluster similarity, providing robust and efficient results. Experimental results on real-life datasets demonstrate the algorithm's effectiveness.

muldowney
Download Presentation

A novel clustering algorithm based on weighted support and its application

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A novel clustering algorithm based on weighted support and its application Author : Xiang-Rong Yang Jun-Yi Shen Qlang Liu Graduate : Chien-Ming Hsiao

  2. Outline • Motivation • Objective • Introduction • Description of some Terms • Algorithm and Analysis • Experimental results • Conclusions • Personal opinion

  3. Motivation • Many efficient clustering algorithms have been proposed but most of these works focus on numerical data.

  4. Objective • To present a novel and efficient algorithm WeiSC for clustering categorical data

  5. Introduction • Clustering is an important KDD problem. • Objective : to group data into sets • Intra-cluster similarity is maximized • Inter-cluster similarity is minimized • Most of these works focus on numerical data whose inherent geometric properties can be exploited naturally to define distance functions between data points.

  6. Introduction • The basic idea of WeiSC • It repeatedly read tuples from dataset one by one • When the first tuple arrives, it forms a cluster alone • The consequent tuples are either put into existing cluster or rejected by all existing clusters to form a new cluser by given similarity function defined between tuple and cluser. • Only makes one scan over the dataset

  7. Description of some Terms

  8. Description of some Terms • DEFINITION 1 • DEFINITION 2 • DEFINITION 3

  9. Description of some Terms • DEFINITION 4 • DEFINITION 5

  10. Algorithm and Analysis • Overview • Initially, the first tuple in the database is read and a cluster is constructed. • Then the consequent tuples are read iteratively. • The similarity between the new tuple and each existed clusters is computed according to • The similarity must be above the threshold, denoted as σ • When computing the similarity, we use the clusters’ summary instead of the clusters themselves, since the information needed contained in clusters’ summary

  11. Computational complexities • The time and space complexities of the WeiSC algorithm depend on • The size of dataset (|D|) • The number of attributes (m) • The number of the clusters (p) , f (σ) • The size of each cluster, g(σ) • Time complexity O(|D| * m * f (σ)) • Space complexity O(|D| + m * f (σ) * g(σ))

  12. Experimental results • The experimental results on the performance of WeiSC • Compare the clustering result with ROCK’s on the same data set

  13. Quality of clustering results with real-life datasets • Mushroom dataset (real-life) • get from the UCI machine learning • Corresponding to 23 species of gilled mushrooms • Each species is identified as definitely edible, definitely poisonous • Has 21 attributes with 8124 tuples • The number of edible is 4208 • The number of poisonous is 3916

  14. The effect of σ • The parameter of σ • Is the only parameter needed in WeiSC algorithm • Effects the results of clustering and the speed of algorithm • Can use the percentage of misclassified tuples as measure of the effect • Since the “edible” or “poisonous” has been labeled in each tuple

  15. Conclusions • The WeiSC algorithm is robust and efficient • From inference and experimental • Read dataset only once • Used in IDS • Is speedy and deserves good efficiency

  16. Personal Opinion • We can compare WeiSC algorithm with our algorithm.

More Related