200 likes | 307 Views
DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation. Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel 101tec GmbH, Halle, Germany. Overview. Density-based clustering and DENCLUE 1.0 Hill climbing as EM-algorithm
E N D
DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel 101tec GmbH, Halle, Germany
Overview • Density-based clustering and DENCLUE 1.0 • Hill climbing as EM-algorithm • Identification of local maxima • Applications of general EM-acceleration • Experiments
Density-Based Clustering • Assumption • clusters are regions of high density in the data space , • How to estimate density? • parametric models • mixture models • non-parametric models • histogram • kernel density estimation
Kernel Density Estimation • Idea • influence of a data point is modeled by a kernel • density is the normalized sum of all kernels • smoothing parameter h Gaussian Kernel Density Estimate
DENCLUE 1.0 Framework • Clusters are defined by local maxima of the density estimate • find all maxima by hill climbing • Problem • const. step size Gradient Hill Climbing const. step size
Problem of const. Step Size • Not efficient • many unnecessary small steps • Not effective • does not converge to a local maximumjust comes close • Example
New Hill Climbing Approach • General approach • differentiate density estimate and set to zero • no solution, but can be used for iteration
New DENCLUE 2.0 Hill Climbing • Efficient • automatically adjusted step size at no extra costs • Effective • converges to local maximum (proof follows) • Example
Proof of Convergence • Cast the problem of maximizing kernel denstiy as maximizing the likelihood of a mixture model • Introduce hidden variable
Proof of Convergence • Complete likelihood is maximized by EM-Algorithm • this also maximizes the original likelihood, which is the kernel density estimate • When starting the EM with we do the hill climbing for E-Step M-Step
Identification of local Maxima • EM-Algorithm iterates until • reached end point • sum of k last step sizes • Assumption • true local maximum is in a ball of around • Points with end points closerbelong to the same maximum M • In case of non-unique assignmentdo a few extra EM iterations
Acceleration • Sparse EM • update only the p% points with largest posterior • saves 1-p% of kernel computations after first iteration • Data Reduction • use only %p of the data as representative points • random sampling • kMeans
Experiments • Comparison of DENCLUE 1.0 (FS) vs. 2.0 (SSA) • 16-dim. artificial data • both methods are tuned to find the correct clustering
Experiments • Comparison of acceleration methods
Experiments • Clustering quality (normalized mutual information, NMI) vs. sample size (RS)
Experiments • Cluster Quality (NMI) of DENCLUE 2.0 (SSA) and acceleration methods and k-Means on real data sample sizes 0.8, 0.4, 0.2
Conclusion • New hill climbing for DENCLUE • Automatic step size adjustment • Convergence proof by reduction to EM • Allows the application of general EM accelerations • Future work • automatic setting of smoothing parameter h(so far tuned manually)