Density Traversal Clustering and Generative Kernels

Density Traversal Clusteringand Generative Kernels a generative framework for spectral clustering Amos Storkey, Tom G Griffiths University of Edinburgh Amos Storkey, School of Informatics.

Attribute Generalisation Amos Storkey, School of Informatics, University of Edinburgh

Prior work • Tishby and Slonim • Meila and Shi • Coifman et al • Nadler et al Amos Storkey, School of Informatics, University of Edinburgh

Example: Transition Matrix Amos Storkey, School of Informatics, University of Edinburgh

Example: 20 Iterations Amos Storkey, School of Informatics, University of Edinburgh

Argument • A priori dependence on data. • No generative model. • Inconsistent with underlying density. • Clusters are spatial characteristics that are properties of distributions. • Clusters are only properties of data sets in as much as they inherit the property from the underlying distribution from which the data was generated. Amos Storkey, School of Informatics, University of Edinburgh

But we do know • Know diffusion asymptotics, but probabilistic formalism inconsistent with data density: • Finite time-step, infinite data limit equilibrium distribution does not match data distribution. Amos Storkey, School of Informatics, University of Edinburgh

Density Traversal Clustering • Define discrete time, continuous, diffusing Markov chain. • Definition dependent on some latent distribution. • Call this the Traversal Distribution. Amos Storkey, School of Informatics, University of Edinburgh

The Markov chain • Transition with probability • D(y,x) is Gaussian centred at x, P* is Traversal distribution. • Here S is given by the solution of Amos Storkey, School of Informatics, University of Edinburgh

Generative procedure Amos Storkey, School of Informatics, University of Edinburgh

Problems • Random walk in continuous space • Each step involves many intractable integrals. • Real Bayesians would... • Good prior distributions over distributions is a hard problem, but need prior for traversal distributions. Amos Storkey, School of Informatics, University of Edinburgh

CHEAT • Doing all the integrals is not possible, but... • All integrals are with respect to traversal distribution • Use empirical data proxy • All the integrals now become sample estimates: sums over the data points. • Everything is computable in the space of data points. • WORKS!: never need to evaluate the probability at a point, only integrals over regions. Amos Storkey, School of Informatics, University of Edinburgh

We get… • Scaled likelihood P(xi | centre xj) / P(xi) = n (AD)ij • A = WS-1 • W is usual affinity • S-1is extra consistency term. • More generally have out of sample scaled likelihood: • P(x | centre y) / P(x)= n a(x)T(AD-2)b(y) where a(x) and b(x) are the traversal probabilities to and from x. Amos Storkey, School of Informatics, University of Edinburgh

Example: Scaled likelihoods Amos Storkey, School of Informatics, University of Edinburgh

Initial distribution • Can consider other initial distributions. • Specifically can consider delta functions at mixture centres. • Variational Bayesian Mixture models… Amos Storkey, School of Informatics, University of Edinburgh

Demo Amos Storkey, School of Informatics, University of Edinburgh

Number of clusters • Scaled likelihoods for three cluster problem. Amos Storkey, School of Informatics, University of Edinburgh

Number of clusters • Scaled likelihoods for a five cluster problem. Amos Storkey, School of Informatics, University of Edinburgh

Cluster allocations Amos Storkey, School of Informatics, University of Edinburgh

Conclusion • A priori formulation of spectral clustering. • Can be used as any other spectral procedure • But also provides scaled likelihoods – can be combined with Bayesian procedures. • Variational Bayesian formalism. • Small sample approximation issues. • Better to have a flexible density estimator. Amos Storkey, School of Informatics, University of Edinburgh

X Generative Kernels • Related to Seeger: Covariance Kernels from Bayesian Generative Models Gaussian Process over X space Density, and corresponding traversal process. Data is obtained by diffusing in X space using the traversal process... And then local averaging and Additive noise. Amos Storkey, School of Informatics, University of Edinburgh

Generative Kernels • Covariance Kijis • Again use sample estimates. • Presume measured target is local average. • Just standard basis function derivation of GP. Amos Storkey, School of Informatics, University of Edinburgh

Motivation • Generative model generates clustered data positions. • Targets diffuse using traversal process. • Target values suffer locality averaging influence: • Diffused objects locally influence one another’s target values so everyone becomes like their neighbours. • E.g. Accents. • Can add local measurement noise. Amos Storkey, School of Informatics, University of Edinburgh

Kernel Clustering • Use sample estimates again to get kernel • Can also encorporate a prior over iterations and integrate out. • For example can use matrix exponential exp(A) instead of (AD). Amos Storkey, School of Informatics, University of Edinburgh

Generating targets for rings data • Can generate from the model: • Across cluster covariance is low. • Within cluster continuity. Amos Storkey, School of Informatics, University of Edinburgh

The point? • Density dependence matters in missing data problems. • Gaussian process: data with missing targets has no influence. • Density Traversal Kernel: data with missing targets affects kernel, and hence has influence. Amos Storkey, School of Informatics, University of Edinburgh

Density Traversal Clustering and Generative Kernels