290 likes | 477 Views
An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Alexander Hinneburg and Daniel A. Keim. Outline. Multimedia data Density-based clustering Influence and density functions Center-defined vs. Arbitrary-shape Comparison with other algorithms Algorithm
E N D
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim
Outline • Multimedia data • Density-based clustering • Influence and density functions • Center-defined vs. Arbitrary-shape • Comparison with other algorithms • Algorithm • What can we learn / have we learned?
Multimedia Data • Examples • Images • CAD • Geographic • Molecular biology • High-dimensional feature vectors • Color histograms • Shape descriptors • Fourier vectors
Density-Based Clustering(loose definition) • Clusters defined by high density of points • Many points with the same combination of attribute values • Is density irrelevant for other methods? No! • Most methods look for dense areas • DENCLUE uses density directly
Density-Based Clustering(stricter definition) • Closeness to a dense area is the only criterion for cluster membership DENCLUE has two variants • Arbitrary-shaped clusters • Similar to other density based methods • Center-defined clusters • Similar to distance-based methods
Idea • Each data point has an influence that extends over a range • Influence function • Add all influence functions • Density function
Definitions • Density Attractor x* • Local maximum of the density function • Density attracted points • Points from which a path to x* exists for which the gradient is continuously positive (case of continuous and differentiable influence function)
Center Defined Clusters • All points that are density attracted to a given density attractor x* • Density function at the maximum must exceed x • Points that are attracted to smaller maxima are considered outliers
Arbitrary-Shape Clusters • Merges center defined clusters if a path exists for which the density function continuously exceeds x
Noise Invariance • Density distribution for noise is constant • No influence on number and location of attractors Claim • Number of density attractors with or without noise is the same • Probability that they are identical goes to 1 for large noise
Parameter Choices Choice of s: • Use different s and determine largest interval with constant number or clusters Choice of x: • Greater than noise level • Smaller than smallest relevant maxima
Comparison with DBSCAN Corresponding setup • Square wave influence function radius s models neighborhood e in DBSCAN • Definition of core objects in DBSCAN involves MinPts <=> x • Density reachable in DBSCAN becomes density attracted in DENCLUE (!?)
Comparison with k-means Corresponding setup • Gaussian influence function • Step-size for hill-climbing e = s/2 Claim • In DENCLUE s can be chosen such that k clusters are found • DENCLUE result corresponds to global optimum in k-means
Comparison with Hierarchical Methods • Start with very small s to get largest number of clusters • Increasing s will merge clusters • Finally one density attractor
Algorithm • Step 1: Construct a map of data points • Uses hypercubes of with edge length 2s • Only populated cubes are saved • Step 2: Determine density attractors for all points using hill-climbing • Keeps track of paths that have been taken and points close to them
Local Density Function • Influence function of “near” points contributes fully • Far away points are ignored • For Gaussian influence function: • cut-off chosen as 4s
Step 1: Constructing the map • Hypercubes contain • Number of data points • Pointers to data points • Sum of data values (for mean) • Save populated hypercubes in B+ tree • Connect neighboring populated cubes for fast access • Limited to highly populated cubes derived from outlier criterion
Step 2: Clustering Step • Uses only highly populated cubes and cubes that are connected to them • Hill-climbing based on local density function and its gradient • Points within s/2 of each hill-climbing path are attached to clusters as well
Time Complexity / Efficiency • Worst case, for N data points • O(N log(N)) • Average case (without building data structure?) • O(log(N)) • Explanation: Only highly populated areas are considered • Up to 45 times faster than DBSCAN
Application to Molecular Biology • Simulation of a small but flexible peptide • Point in a 19-dimensional angle space • Pharmaceutical industry is interested in stable conformations • Non-stable conformations make up >50 percent => noise
What can we learn? Algorithm is fast for 2 reasons • Efficient data structure • Data points that are close in attribute space are stored together • Similar to P-trees: fast access to data, based on attribute values • Optimization problem inherently linear in search space • K-medoids problem is quadratic!
Why is k-medoids quadratic in the search space? Review: • Cost function calculated as sum over squared distance within each cluster • I.e. cost associated with each cluster center depends on all other cluster centers! • Can be viewed as an influence function that depends on cluster boundaries
K-medoids DENCLUE Cost functions
Motivating a Gaussian influence function • Why not use a parabola as influence function? • Only 1 minimum (mean of data set) • We need cut-off • K-medoids cut-off depends on cluster centers • Cluster center independent cut-off? • Gaussian function!
Is DENCLUE only an Approximation to k-medoids? Not necessarily • Minimizing square distance is a fundamental measure but not the only one • Why should “influence” depend on density of points? • “Influence” may be determined by system
If DENCLUE is so good can we still improve it? • Need a special data structure • They map out all space • Density-based idea • A distance based version can look for cluster centers only • Allows using a promising starting point • Define partitions by proximity
Conclusion • DENCLUE paper contains many fundamentally valuable ideas • Data structure efficient • Algorithm related to but much more efficient than k-medoids