Information Theoretic Clustering: A Novel Approach for Evaluating Divergence Measures

國立雲林科技大學 National Yunlin University of Science and Technology • Information Theoretic Clustering • Advisor ：Dr. Hsu • Graduate：Ching-Lung Chen • Author ：Pabitra Mitra • Student Member IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 2, FEBRUARY 2002

Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Divergence Measures • The Clustering Evaluation Function • Optimization • Experimental Results • Conclusions • Personal Opinion • Review

Motivation • N.Y.U.S.T. • I.M. • The major problem of clustering based on information theoretic measures has been the difficulty to evaluate the metric without imposing unrealistic assumptions about the data distribution. • The KNN algorithm has local minima problem.

Objective • N.Y.U.S.T. • I.M. • To develop a novel clustering algorithm based on a sample-by-sample estimator of Renyi’s entropy that will avoid this shortcoming.

Introduction 1/2 • N.Y.U.S.T. • I.M. • There are two basic approaches to clustering • Parametric • assume a predefined distribution for the data set and calculate the sufficient statistics. ( mean、covariance) • Use a mixture of distributions to describe the data • Nonparame • Use a criterion function and seek the grouping that maximizes the criterion • Need a cost function to evaluate how well the clustering fits to the data, and an algorithm to minimize the cost function.

Introduction 2/2 • N.Y.U.S.T. • I.M. • The majority of clustering metrics are based on a minimum variance criterion. i.e. (merging and splitting、neighborhood dependent、hierarchical methods、ART) • Valley seeking clustering is a different concept that exploits not the regions of high sample density but the regions of less data. • It’s attempts to divide the data in a way similar to supervised classifiers, that is by positioning discriminant functions in data space.

Divergence Measures 1/3 • N.Y.U.S.T. • I.M. • The clustering problem has been formulated as a distance between two distributions, but most of the proposed measures are limited to second order statistics (i.e. covariance). • Cross-entropy is also called directed divergence since it is not symmetrical. i.e. D(p,q) , if D(p,q) is not symmetric, then D’(p,q)=D(p,q)+D(q,p) • Under certain conditions, the minimization of directed divergence is equivalent to the maximization of the entropy.

Divergence Measures 2/3 • N.Y.U.S.T. • I.M. • The kullback-Leibler’s (K-L) cross-entropy : f(x) and g(x) are two probability density functions of the random variable x. • Bhattacharya divergence measure DB(f,g) :

Divergence Measures 3/3 • N.Y.U.S.T. • I.M. • Chernoff distance or generalized Bhattacharya distance is a nonsymmetric measure : • Renyi’s divergence measure : * Bhattacharya distance corresponds to s=1/2 in (3) and generalized Bhattacharya distance corresponds to a = 1-s

The Clustering Evaluation Function 1/5 • N.Y.U.S.T. • I.M. • Minimizing the incremental entropy cost use an evaluation function for clustering • Exploit the boundaries among the data clusters as in valley seeking algorithms. • Samples should be clustered when there is natural boundaries between them, which can be measured as divergence or cross-entropy between the clustered subsets of the data.

The Clustering Evaluation Function 2/5 • N.Y.U.S.T. • I.M. • The most utilized divergence measure is the Kullback-Leibler. • However, the problem is how to estimate the K-L divergence directly from data in a nonparametric fashion as required by pattern recognition and machine learning applications. • Alfred Renyi proposed in the 60s a new information measure, which become known as Renyi’s entropy and provided the starting point for an easier nonparametric estimator for entropy.

The Clustering Evaluation Function 3/5 • N.Y.U.S.T. • I.M. • In order to use (5) in the calculations, we need a way to estimate the probability density function. • We use Parzen Window Method for simplicity and take full advantage of the properties of the multidimensional Gaussian function. The kernel used here is. is the covariance matrix

The Clustering Evaluation Function 4/5 • N.Y.U.S.T. • I.M. • We will assume that . For a data set the probability density function can be estimated as • Substituting (7) into (5) with a=2, we obtain • We call (8) Renyi’s Quadratic entropy.

The Clustering Evaluation Function 5/5 • N.Y.U.S.T. • I.M. • The reason for using a=2 is the exact calculation of the integral in (8) directly from the samples (i.e., nonparametrically) as • We call the quantity Information Potential

Derivation • N.Y.U.S.T. • I.M. • Evaluate the information potential between two subgroups, use clustering evaluation function (CEF): ( ) • In order to be more explicit, we redefine the CEF function: M=1 both sample are from different distributions M=0 both sample are from the same distribution

CEF as a Nonlinear Weighted Distance 1/2 • N.Y.U.S.T. • I.M. • One conventional way of measuring the distance between two clusters is the average distance: • This measure works well when the clusters are well separated and compact, but it will fail if the clusters are close to each other producing nonlinear boundaries.

CEF as a Nonlinear Weighted Distance 2/2 • N.Y.U.S.T. • I.M. • When we use the kernel function the average distance function becomes: • which is exactly the CEF function (11). The CEF has an additional parameter , which controls the variance of the kernel.

CEF as a Distance 1/3 • N.Y.U.S.T. • I.M. • We expect that CEF is measuring some type of distance between the subgroupings. Let’s define the following distance measure: * • Certain properties must be verified for DCEFnorm(p,q) to be a distance. • Which should be satified is DCEFnorm(p,q) =0 when p(x)=q(x). • DCEFnorm(p,q) is symmetry

CEF as a Distance 2/3 • N.Y.U.S.T. • I.M. • In optimization, the fact that the minimum distance is not zero is irrelevant, so we can write a pseudo distance just with the CEF as • We are optimizing DCEFnorm(p,q) when find the extremes of CEF(p,q).

CEF as a Distance 3/3 • N.Y.U.S.T. • I.M.

Multiple Clusters • N.Y.U.S.T. • I.M. • We want to measure the divergence between different clusters, the measure should include the divergence from one cluster to all the others. M() = 0 if both samples are in the same cluster. M() = 1 if oth samples are in different clusters. • The function is calculated between points of different clusters, not between points inside the same cluster

Optimization-Grouping Algorithm • N.Y.U.S.T. • I.M. 1 2 3

Optimization-Grouping Algorithm • N.Y.U.S.T. • I.M.

Optimization Algorithm • N.Y.U.S.T. • I.M.

Experimental Results • N.Y.U.S.T. • I.M. • Testing on Synthetic Data

Experimental Results • N.Y.U.S.T. • I.M.

Experimental Results • N.Y.U.S.T. • I.M. K-mean EM algorithm Supervised classification

Experimental Results • N.Y.U.S.T. • I.M. • Testing on MRI Data

Experimental Results • N.Y.U.S.T. • I.M.

Conclusions • N.Y.U.S.T. • I.M. • The goal of this research was to develop a better clustering algorithm because second order statistics are not sufficient to distinguish nonlinearly separable clusters. • The cost function is developed using Renyi’s quadratic entropy to evaluate a distance between probability density function. • The CEF algorithm can be improved further, It is possible to adapt the kernael shape and size dynamically according to the data. • Running the algorithm with large data set is still very time consuming.

Personal Opinion • N.Y.U.S.T. • I.M. • This paper is another using information theoretic to clustering, we can learn this method to calculate the distance between two clusters.

Review • N.Y.U.S.T. • I.M. Calculate CEF information potential Use K-change to clustering data group with valley seeking clustering algorithm.

Information Theoretic Clustering: A Novel Approach for Evaluating Divergence Measures