1 / 25

Entropy-based Subspace Clustering for Mining Numerical Data

Entropy-based Subspace Clustering for Mining Numerical Data. Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung Cheng, Ada Wai-chee Fu , Yi Zhang. ACM 1999. Outline. Motivation Objective Introduction Related Work Criteria of Subspace clustering Entropy-based Method

ljaime
Download Presentation

Entropy-based Subspace Clustering for Mining Numerical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entropy-based Subspace Clustering for Mining Numerical Data Advisor :Dr. Hsu Graduate: Yu Cheng Chen Author: Chung-hung Cheng, Ada Wai-chee Fu , Yi Zhang ACM 1999

  2. Outline • Motivation • Objective • Introduction • Related Work • Criteria of Subspace clustering • Entropy-based Method • Entropy vs the Clustering Criteria • Algorithm • Experiments • Conclusions

  3. Motivation • Real-life databases contain many attributes. • Most traditional clustering methods are shown to handle problem sizes of several hundreds to several thousands transactions.

  4. Objective • Propose a method that gives reasonable performance on high dimensionality and large data sets.

  5. Introduction • A good clustering algorithm need: • Handle arbitrary shapes for clusters • Do not make assumptions about distribution of data • Not be sensitive to the outliers • Not require input parameters • Convey the resulting clusters to the users

  6. Introduction • A solution to the above problem would consist of the following steps: (1) Find the subspaces with good clustering (2) Identify the clusters in the selected subspaces. (3) present the result to the users

  7. Related Work • CLIQUE is the only published algorithm that satisfied to identify clusters embedded in subspaces of datasets. • Two parameters, ξand τ. • Partition every dimension into ξintervals ofequal length (unit). • A unit is dense if data points contained in it is > τ

  8. Related Work • Clusters are unions of connected dense units. • To reduce the search space we used a bottom-up algorithm. • If a collection of points S is a cluster in k dimensional space, then S is also part of a cluster in (k-1) dimensional projections of space.

  9. Related Work • Example

  10. High coverage. High density Correlation of dimensions Criteria of Subspace Clustering

  11. Entropy-based Method • Entropy is defined as following: • Calculation of Entropy • where d(x) be the density of a cell x in terms of the percentage of data contained in x.

  12. Entropy vs the Clustering Criteria • Entropy and the coverage criterion.

  13. Entropy vs the Clustering Criteria • We want to establish the relationship that, under certain conditions, the entropy decreases as the coverage increases.

  14. Entropy vs the Clustering Criteria • Entropy and the density criterion. • Assume that the density of dense units are all equal to α, the density of non-dense units are all equal to p.

  15. Entropy vs the Clustering Criteria • Entropy and variable correlation

  16. Algorithm • Overall strategy consists of three main steps:

  17. Algorithm • A subspace whose entropy is below w is considered to have good clustering. • We start with finding 1 dimensional subspace with good clustering, then we use them to generate the can candidate 2 dimensional subspaces.

  18. Algorithm • Dimensions Correlation

  19. Algorithm • We define the term interest as below: • The higher the interest, the stronger the correlation

  20. Algorithm • Downward closure is a pruning property. • If a subspace does not satisfy this property, we can cross out all its super-spaces • Upward closure is a constructive property. • If a subspace satisfies the property, all its Super-spaces also satisfy this property.

  21. Algorithm

  22. Experiments • We use data of 10 dimensions and 300,000 transaction in the experiment.

  23. Experiments • Figure 10 & 11

  24. Conclusions • We establish some relationship between entropy and the three criteria. • We incorporates the idea of using a pair of downward and upward closure properties which is shown effective in the reduction of the search space.

  25. Personal Opinion • …

More Related