1 / 29

An Efficient Approach to Clustering in Large Multimedia Databases with Noise

An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Alexander Hinneburg and Daniel A. Keim. Outline. Multimedia data Density-based clustering Influence and density functions Center-defined vs. Arbitrary-shape Comparison with other algorithms Algorithm

waneta
Download Presentation

An Efficient Approach to Clustering in Large Multimedia Databases with Noise

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

  2. Outline • Multimedia data • Density-based clustering • Influence and density functions • Center-defined vs. Arbitrary-shape • Comparison with other algorithms • Algorithm • What can we learn / have we learned?

  3. Multimedia Data • Examples • Images • CAD • Geographic • Molecular biology • High-dimensional feature vectors • Color histograms • Shape descriptors • Fourier vectors

  4. Density-Based Clustering(loose definition) • Clusters defined by high density of points • Many points with the same combination of attribute values • Is density irrelevant for other methods? No! • Most methods look for dense areas • DENCLUE uses density directly

  5. Density-Based Clustering(stricter definition) • Closeness to a dense area is the only criterion for cluster membership DENCLUE has two variants • Arbitrary-shaped clusters • Similar to other density based methods • Center-defined clusters • Similar to distance-based methods

  6. Idea • Each data point has an influence that extends over a range • Influence function • Add all influence functions • Density function

  7. Influence Functions

  8. Definitions • Density Attractor x* • Local maximum of the density function • Density attracted points • Points from which a path to x* exists for which the gradient is continuously positive (case of continuous and differentiable influence function)

  9. Center Defined Clusters • All points that are density attracted to a given density attractor x* • Density function at the maximum must exceed x • Points that are attracted to smaller maxima are considered outliers

  10. Arbitrary-Shape Clusters • Merges center defined clusters if a path exists for which the density function continuously exceeds x

  11. Examples

  12. Noise Invariance • Density distribution for noise is constant • No influence on number and location of attractors Claim • Number of density attractors with or without noise is the same • Probability that they are identical goes to 1 for large noise

  13. Parameter Choices Choice of s: • Use different s and determine largest interval with constant number or clusters Choice of x: • Greater than noise level • Smaller than smallest relevant maxima

  14. Comparison with DBSCAN Corresponding setup • Square wave influence function radius s models neighborhood e in DBSCAN • Definition of core objects in DBSCAN involves MinPts <=> x • Density reachable in DBSCAN becomes density attracted in DENCLUE (!?)

  15. Comparison with k-means Corresponding setup • Gaussian influence function • Step-size for hill-climbing e = s/2 Claim • In DENCLUE s can be chosen such that k clusters are found • DENCLUE result corresponds to global optimum in k-means

  16. Comparison with Hierarchical Methods • Start with very small s to get largest number of clusters • Increasing s will merge clusters • Finally one density attractor

  17. Algorithm • Step 1: Construct a map of data points • Uses hypercubes of with edge length 2s • Only populated cubes are saved • Step 2: Determine density attractors for all points using hill-climbing • Keeps track of paths that have been taken and points close to them

  18. Local Density Function • Influence function of “near” points contributes fully • Far away points are ignored • For Gaussian influence function: • cut-off chosen as 4s

  19. Step 1: Constructing the map • Hypercubes contain • Number of data points • Pointers to data points • Sum of data values (for mean) • Save populated hypercubes in B+ tree • Connect neighboring populated cubes for fast access • Limited to highly populated cubes derived from outlier criterion

  20. Step 2: Clustering Step • Uses only highly populated cubes and cubes that are connected to them • Hill-climbing based on local density function and its gradient • Points within s/2 of each hill-climbing path are attached to clusters as well

  21. Time Complexity / Efficiency • Worst case, for N data points • O(N log(N)) • Average case (without building data structure?) • O(log(N)) • Explanation: Only highly populated areas are considered • Up to 45 times faster than DBSCAN

  22. Application to Molecular Biology • Simulation of a small but flexible peptide • Point in a 19-dimensional angle space • Pharmaceutical industry is interested in stable conformations • Non-stable conformations make up >50 percent => noise

  23. What can we learn? Algorithm is fast for 2 reasons • Efficient data structure • Data points that are close in attribute space are stored together • Similar to P-trees: fast access to data, based on attribute values • Optimization problem inherently linear in search space • K-medoids problem is quadratic!

  24. Why is k-medoids quadratic in the search space? Review: • Cost function calculated as sum over squared distance within each cluster • I.e. cost associated with each cluster center depends on all other cluster centers! • Can be viewed as an influence function that depends on cluster boundaries

  25. K-medoids DENCLUE Cost functions

  26. Motivating a Gaussian influence function • Why not use a parabola as influence function? • Only 1 minimum (mean of data set) • We need cut-off • K-medoids cut-off depends on cluster centers • Cluster center independent cut-off? • Gaussian function!

  27. Is DENCLUE only an Approximation to k-medoids? Not necessarily • Minimizing square distance is a fundamental measure but not the only one • Why should “influence” depend on density of points? • “Influence” may be determined by system

  28. If DENCLUE is so good can we still improve it? • Need a special data structure • They map out all space • Density-based idea • A distance based version can look for cluster centers only • Allows using a promising starting point • Define partitions by proximity

  29. Conclusion • DENCLUE paper contains many fundamentally valuable ideas • Data structure efficient • Algorithm related to but much more efficient than k-medoids

More Related