An Efficient Approach to Clustering in Large Multimedia Databases with Noise

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim

Outline • Multimedia data • Density-based clustering • Influence and density functions • Center-defined vs. Arbitrary-shape • Comparison with other algorithms • Algorithm • What can we learn / have we learned?

Multimedia Data • Examples • Images • CAD • Geographic • Molecular biology • High-dimensional feature vectors • Color histograms • Shape descriptors • Fourier vectors

Density-Based Clustering(loose definition) • Clusters defined by high density of points • Many points with the same combination of attribute values • Is density irrelevant for other methods? No! • Most methods look for dense areas • DENCLUE uses density directly

Density-Based Clustering(stricter definition) • Closeness to a dense area is the only criterion for cluster membership DENCLUE has two variants • Arbitrary-shaped clusters • Similar to other density based methods • Center-defined clusters • Similar to distance-based methods

Idea • Each data point has an influence that extends over a range • Influence function • Add all influence functions • Density function

Influence Functions

Definitions • Density Attractor x* • Local maximum of the density function • Density attracted points • Points from which a path to x* exists for which the gradient is continuously positive (case of continuous and differentiable influence function)

Center Defined Clusters • All points that are density attracted to a given density attractor x* • Density function at the maximum must exceed x • Points that are attracted to smaller maxima are considered outliers

Arbitrary-Shape Clusters • Merges center defined clusters if a path exists for which the density function continuously exceeds x

Examples

Noise Invariance • Density distribution for noise is constant • No influence on number and location of attractors Claim • Number of density attractors with or without noise is the same • Probability that they are identical goes to 1 for large noise

Parameter Choices Choice of s: • Use different s and determine largest interval with constant number or clusters Choice of x: • Greater than noise level • Smaller than smallest relevant maxima

Comparison with DBSCAN Corresponding setup • Square wave influence function radius s models neighborhood e in DBSCAN • Definition of core objects in DBSCAN involves MinPts <=> x • Density reachable in DBSCAN becomes density attracted in DENCLUE (!?)

Comparison with k-means Corresponding setup • Gaussian influence function • Step-size for hill-climbing e = s/2 Claim • In DENCLUE s can be chosen such that k clusters are found • DENCLUE result corresponds to global optimum in k-means

Comparison with Hierarchical Methods • Start with very small s to get largest number of clusters • Increasing s will merge clusters • Finally one density attractor

Algorithm • Step 1: Construct a map of data points • Uses hypercubes of with edge length 2s • Only populated cubes are saved • Step 2: Determine density attractors for all points using hill-climbing • Keeps track of paths that have been taken and points close to them

Local Density Function • Influence function of “near” points contributes fully • Far away points are ignored • For Gaussian influence function: • cut-off chosen as 4s

Step 1: Constructing the map • Hypercubes contain • Number of data points • Pointers to data points • Sum of data values (for mean) • Save populated hypercubes in B+ tree • Connect neighboring populated cubes for fast access • Limited to highly populated cubes derived from outlier criterion

Step 2: Clustering Step • Uses only highly populated cubes and cubes that are connected to them • Hill-climbing based on local density function and its gradient • Points within s/2 of each hill-climbing path are attached to clusters as well

Time Complexity / Efficiency • Worst case, for N data points • O(N log(N)) • Average case (without building data structure?) • O(log(N)) • Explanation: Only highly populated areas are considered • Up to 45 times faster than DBSCAN

Application to Molecular Biology • Simulation of a small but flexible peptide • Point in a 19-dimensional angle space • Pharmaceutical industry is interested in stable conformations • Non-stable conformations make up >50 percent => noise

What can we learn? Algorithm is fast for 2 reasons • Efficient data structure • Data points that are close in attribute space are stored together • Similar to P-trees: fast access to data, based on attribute values • Optimization problem inherently linear in search space • K-medoids problem is quadratic!

Why is k-medoids quadratic in the search space? Review: • Cost function calculated as sum over squared distance within each cluster • I.e. cost associated with each cluster center depends on all other cluster centers! • Can be viewed as an influence function that depends on cluster boundaries

K-medoids DENCLUE Cost functions

Motivating a Gaussian influence function • Why not use a parabola as influence function? • Only 1 minimum (mean of data set) • We need cut-off • K-medoids cut-off depends on cluster centers • Cluster center independent cut-off? • Gaussian function!

Is DENCLUE only an Approximation to k-medoids? Not necessarily • Minimizing square distance is a fundamental measure but not the only one • Why should “influence” depend on density of points? • “Influence” may be determined by system

If DENCLUE is so good can we still improve it? • Need a special data structure • They map out all space • Density-based idea • A distance based version can look for cluster centers only • Allows using a promising starting point • Define partitions by proximity

Conclusion • DENCLUE paper contains many fundamentally valuable ideas • Data structure efficient • Algorithm related to but much more efficient than k-medoids

An Efficient Approach to Clustering in Large Multimedia Databases with Noise

An Efficient Approach to Clustering in Large Multimedia Databases with Noise

Presentation Transcript

Multimedia Databases

CURE: An Efficient Clustering Algorithm for Large Databases

Supporting Noise-Free Queries in Large Image Databases

Birch: An efficient data clustering method for very large databases

An Information Theoretic Approach to Bilingual Word Clustering

Indexing similarity for efficient search in multimedia databases

An Efficient Algorithm for Mining Time Interval-based Patterns in Large Databases

Supporting Noise-Free Queries in Large Image Databases

Multimedia Databases

Multimedia Databases

Multimedia Approach to Geometry

Multimedia Databases

Multimedia Databases

Introduction: Multimedia Databases

An Efficient Approach to Extracting Approximate Repeating Patterns in Music Databases

Multimedia Databases

Multimedia Databases

Large Databases in Industry

Multimedia Databases

Multimedia Databases

A New Data Clustering Approach for Data Mining in Large Databases