250 likes | 321 Views
Entropy-based Subspace Clustering for Mining Numerical Data. Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung Cheng, Ada Wai-chee Fu , Yi Zhang. ACM 1999. Outline. Motivation Objective Introduction Related Work Criteria of Subspace clustering Entropy-based Method
E N D
Entropy-based Subspace Clustering for Mining Numerical Data Advisor :Dr. Hsu Graduate: Yu Cheng Chen Author: Chung-hung Cheng, Ada Wai-chee Fu , Yi Zhang ACM 1999
Outline • Motivation • Objective • Introduction • Related Work • Criteria of Subspace clustering • Entropy-based Method • Entropy vs the Clustering Criteria • Algorithm • Experiments • Conclusions
Motivation • Real-life databases contain many attributes. • Most traditional clustering methods are shown to handle problem sizes of several hundreds to several thousands transactions.
Objective • Propose a method that gives reasonable performance on high dimensionality and large data sets.
Introduction • A good clustering algorithm need: • Handle arbitrary shapes for clusters • Do not make assumptions about distribution of data • Not be sensitive to the outliers • Not require input parameters • Convey the resulting clusters to the users
Introduction • A solution to the above problem would consist of the following steps: (1) Find the subspaces with good clustering (2) Identify the clusters in the selected subspaces. (3) present the result to the users
Related Work • CLIQUE is the only published algorithm that satisfied to identify clusters embedded in subspaces of datasets. • Two parameters, ξand τ. • Partition every dimension into ξintervals ofequal length (unit). • A unit is dense if data points contained in it is > τ
Related Work • Clusters are unions of connected dense units. • To reduce the search space we used a bottom-up algorithm. • If a collection of points S is a cluster in k dimensional space, then S is also part of a cluster in (k-1) dimensional projections of space.
Related Work • Example
High coverage. High density Correlation of dimensions Criteria of Subspace Clustering
Entropy-based Method • Entropy is defined as following: • Calculation of Entropy • where d(x) be the density of a cell x in terms of the percentage of data contained in x.
Entropy vs the Clustering Criteria • Entropy and the coverage criterion.
Entropy vs the Clustering Criteria • We want to establish the relationship that, under certain conditions, the entropy decreases as the coverage increases.
Entropy vs the Clustering Criteria • Entropy and the density criterion. • Assume that the density of dense units are all equal to α, the density of non-dense units are all equal to p.
Entropy vs the Clustering Criteria • Entropy and variable correlation
Algorithm • Overall strategy consists of three main steps:
Algorithm • A subspace whose entropy is below w is considered to have good clustering. • We start with finding 1 dimensional subspace with good clustering, then we use them to generate the can candidate 2 dimensional subspaces.
Algorithm • Dimensions Correlation
Algorithm • We define the term interest as below: • The higher the interest, the stronger the correlation
Algorithm • Downward closure is a pruning property. • If a subspace does not satisfy this property, we can cross out all its super-spaces • Upward closure is a constructive property. • If a subspace satisfies the property, all its Super-spaces also satisfy this property.
Experiments • We use data of 10 dimensions and 300,000 transaction in the experiment.
Experiments • Figure 10 & 11
Conclusions • We establish some relationship between entropy and the three criteria. • We incorporates the idea of using a pair of downward and upward closure properties which is shown effective in the reduction of the search space.
Personal Opinion • …