Clustering on the Simplex

Clustering on the Simplex Morten Mørup DTU Informatics Intelligent Signal Processing Technical University of Denmark EMMDS 2009 July 3rd, 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Joint work with Christian Walder DTU Informatics Intelligent Signal Processing Technical University of Denmark Lars Kai Hansen DTU Informatics Intelligent Signal Processing Technical University of Denmark EMMDS 2009 July 3rd, 2009

Clustering Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. (Wikipedia) EMMDS 2009 July 3rd, 2009

Clustering approaches Assignmnt Step (S): Assign each data point to the cluster with closest mean value Update Step (C):Calculate the new mean value for each cluster • K-means iterative refinement algorithm (Lloyd, 1982; Hartigan, 1979) • Problem NP-complete (Megiddo and Supowit, 1984) Relaxations of the hard assigment problem: • Annealing approaches basedon temperature parameter(T0 the original clustering problem is recovered)(see for instance Hofmann and Buhmann, 1997) • Fuzzy clustering (Hathaway and Bezdek, 1988) • Expectation Maximization (Mixture of Gaussians) • Spectral Clustering Guarantee of optimality: No single change in assignment better than current assignment (1-spin stability). Drawbacks: Previously relaxations are either not exact or dependent on some problem specific annealing parameter in order to recover the original binary combinatorial assignments. EMMDS 2009 July 3rd, 2009

From the K-means objective to Pairwise Clustering K-mean objective Pairwise Clustering (Buhmann and Hofmann, 1994) Ksimilarity matrix, K=XTXequivalent tothe k-means objective EMMDS 2009 July 3rd, 2009

Although Clustering is hard there is room to be simple(x) minded! Binary Combinatorial (BC) Simplicial Relaxation (SR) EMMDS 2009 July 3rd, 2009

The simplicial relaxation (SR) admits standard continuous optimization to solve for the pairwise clustering problems. For instance by normalization invariant projected gradient ascent: EMMDS 2009 July 3rd, 2009

Synthetic data example K-means SR-clustering Brown and grey clusters each contain 1000 data-points in R2 Whereas the remaining clusters each have 250 data-points. EMMDS 2009 July 3rd, 2009

SR-clustering algorithm driven by high density regions EMMDS 2009 July 3rd, 2009

Thus, solutions in general substantially better than Lloyd’s algorithm having the same computational complexity SR-clustering (init=1) SR-clustering (init=0.01) Lloyd’s K-means EMMDS 2009 July 3rd, 2009

K-means SR-clustering (init=1) SR-clustering (init=0.01) 10 components 50 components 100 components EMMDS 2009 July 3rd, 2009

SR-clustering for Kernel based semi-supervised learning Kernel based semi-supervised learning based on pairwise clustering (Basu et al, 2004, Kulis et al. 2005, Kulis et al, 2009) EMMDS 2009 July 3rd, 2009

Simplicial relaxation admit solving the problem as a (non-convex) continous optimization problem EMMDS 2009 July 3rd, 2009

Class labels can be handled explicitly fixingMust and cannot links can be absorbed into the Kernel Hence the problem reduces more or less to standard SR-clustering problem for the estimation of S EMMDS 2009 July 3rd, 2009

At stationarity we have that the gradients of elements in each column of S that are 1 are larger than elements that are 0. Thus, evaluating the impact of the supervision can be done estimating the minimal lagrange multipliers that guarantee stationarity of the solution obtained by the SR-clustering algorithm. This is a convex optimization problem Thus, Lagrange multipliers give a measure of conflict between the data and the supervision EMMDS 2009 July 3rd, 2009

Digit classification with one miss-labeled data observation from each class. EMMDS 2009 July 3rd, 2009

Community Detection in Complex Networks Communities/modules: a natural divisions of network nodes into densely connected subgroups (Newman & Girvan 2003) G(V,E) Adjacency Matrix A Permuted adjacency matrix PAPT Community detection algorithm Permutation P of graph from clustering assignment S EMMDS 2009 July 3rd, 2009

Common Community detection objectives Hamiltonian (Fu & Anderson, 1986, Reichardt & Bornholdt, 2004) Modularity (Newman & Girvan, 2004) Generic problems of the form EMMDS 2009 July 3rd, 2009

Again we can make an exact relaxation to the simplex! EMMDS 2009 July 3rd, 2009

EMMDS 2009 July 3rd, 2009

SR-clustering of complex networks Quality of solutions comparable to results obtained by extensive Gibbs sampling EMMDS 2009 July 3rd, 2009

So far we have demonstrated how binary combinatorial constraints are recovered at stationarity when relaxing the problems to the simplex. However, simplex constraints also holds promising data mining properties of their own! EMMDS 2009 July 3rd, 2009

The Convex Hull The Principal Convex Hull (PCH) Def: The convex hull/convex envelope of XRMNis the minimal convex set containing X.(Informally it can be described as a rubber band wrapped around the data points.) Finding the convex hull is solvable in linear time,O(N)(McCallum and D. Avis, 1979) However, the size of the convex set grows exponentially with the dimensionality of the data,O(logM-1(N)) (Dwyer, 1988) Def: The best convex set of size K according to some measure of distortion D(·|·) (Mørup et al. 2009). (Informally it can be described as a less flexible rubber band that wraps most of the data points.) EMMDS 2009 July 3rd, 2009

The mathematical formulation of the Principal Convex Hull (PCH) is given by two simplex constraints ”Principal” in terms of the Frobenius norm C:Give the fraction in which observations in X are used to form each feature (distinct aspects/freaks). In general C will be very sparse!! S:Give the fraction each observation resembles each distinct aspects XC. X X C S  (note when K large enough such that the PCH recover the convex hull) EMMDS 2009 July 3rd, 2009

Relation between the PCH model, low rank decomposition and clustering approaches PCH naturally bridges clustering and low-rank approximations! EMMDS 2009 July 3rd, 2009

Two important properties of the PCH model The PCH model is invariant to affine transformation and scaling The PCH model is unique up to permutation of the components EMMDS 2009 July 3rd, 2009

A feature extraction example More contrast in features than obtained by clustering approaches. As such, PCH aim for distict aspects/regions in data The PCH model strives to attain Platonic ”Ideal Forms” EMMDS 2009 July 3rd, 2009

Data contain 3 components: High-Binding regions Low-binding regions Non-binding regions Each voxel given concentrationfraction of these regions PCH model for PET data(Positron Emission Tomography) XC S EMMDS 2009 July 3rd, 2009

NMF spectroscopy of samples of mixtures of propanol butanol and pentanol. EMMDS 2009 July 3rd, 2009

Collaborative filtering example Medium size and large size Movie lens data (www.grouplens.org) Medium size: 1,000,209 ratings of 3,952 movies by 6,040 users Large size: 10,000,054 ratings of 10,677 movies given by 71,567 EMMDS 2009 July 3rd, 2009

Conclusion • The simplex offers unique data mining properties • Simplicial relaxations (SR) form exact relaxation of common hard assignment clustering problems, i.e. K-means, Pairwise Clustering and Community detection in graphs. • SR Enable to solve binary combinatorial problems using standard solvers from continuous optimization. • The proposed SR-clustering algorithm outperforms traditional iterative refinement algorithms • No need for annealing parameter. hard assignments guaranteed atstationarity (Theorem 1 and 2) • Semi-Supervised learning can be posed as continuous optimization problem with associated lagrange multipliers giving an evaluation measure of each supervised constraint EMMDS 2009 July 3rd, 2009

Conclusion cont. • The Principal Convex Hull (PCH) formed by two types of simplex constraints • Extract distinct aspects of the data • Relevant for data mining in general where low rank approximation and clustering approaches have been invoked. EMMDS 2009 July 3rd, 2009

A reformulation of ”Lex Parsimoniae” The simplest explanation is usually the best. - William of Ockham The simplex explanation is usually the best. Simplicity is the ultimate sophistication. - Leonardo Da Vinci Simplexity is the ultimate sophistication. The presented work is described in: M. Mørup and L. K. Hansen ”An Exact Relaxation of Clustering”, Submitted JMLR 2009 M. Mørup, C. Walder and L. K. Hansen ”Simplicial Semi-supervised Learning”, submitted M. Mørup and L. K. Hansen ” Platonic Forms Revisited”, submitted EMMDS 2009 July 3rd, 2009

Clustering on the Simplex

Clustering on the Simplex

Presentation Transcript

The Simplex Algorithm

The Simplex Algorithm

The Simplex Algorithm

The Simplex Method

The Simplex Method

The Simplex Method

More on Clustering

Simplex

The Simplex Method

The Simplex Method

The Simplex Method

The Simplex algorithm

The simplex algorithm

The Simplex Method

THE SIMPLEX ALGORITHM

Simplex

The Simplex Method

The Simplex Procedure

The Simplex Algorithm

The Simplex Method