A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing

A Consensus Framework for Integrating Distributed ClusteringsUnder Limited Knowledge Sharing Joydeep Ghosh, Alex Strehl, Srujana Merugu The University of Texas at Austin Joydeep Ghosh UT-ECE

Setting • given multiple clusterings • possibly distributed in time and space • Possibly non-identical sets of objects • obtain a single integrated clustering • w/o sharing algos or features (records) Joydeep Ghosh UT-ECE

Application Scenarios • Knowledge reuse • Consolidate legacy clusterings without accessing detailed object descriptions • Distributed Data Mining • Only some features available per clusterer • Only some objects available per clusterer • (Improve quality and robustness) • Reduce variance • Good results on a wide range of data using a diverse portfolio of algorithms • Estimate reasonable K Joydeep Ghosh UT-ECE

Cluster Ensembles • Given a set of provisional partitionings, we want to aggregate them into a single consensus partitioning, even without access to original features . (individual cluster labels) Clusterer #1 (consensus labels) Joydeep Ghosh UT-ECE

Cluster Ensemble Problem • Let there be r clusterings (r) with k(r) clusters each • What is the integrated clustering  that optimally summarizes the r given clusterings using k clusters? Much more difficult than Classification ensembles Joydeep Ghosh UT-ECE

What is “best” consensus? Maximize average [0, 1]-normalized mutual information with all the individual labelings of the ensemble, , given the number of clusters, k. |Normalized mutual information (NMI) between r.v.s X, Y, NMI(X,Y) = I(X,Y) / sqrt { H(X) H(Y)} Empirical Validation Joydeep Ghosh UT-ECE

Designing a Consensus Function • Direct optimization – impractical • Three efficient heuristics • Cluster-based Similarity Partitioning Alg. (CSPA) • O( n2 k r) • HyperGraph Partitioning Alg. (HGPA) • O( n k r) • Meta-Clustering Alg. (MCLA) • O( n k2 r2) All 3 exploit a hypergraph representation of the sets of cluster labels (input to consensus function) • Supra-consensus function : performs allthree andpicks the one with highest ANMI (fully unsupervised) Joydeep Ghosh UT-ECE

Hypergraph Representation • One hyperedge/cluster • Example: Joydeep Ghosh UT-ECE

Cluster-based Similarity Partitioning (CSPA) • Pairwise object similarity = # of shared hyperedges • Cluster objects based on “consensus” similarity matrix • using e.g., graph-partitioning Joydeep Ghosh UT-ECE

HyperGraph Partitioning Alg. (HGPA) • Partition the hypergraph so that a minimum number of hyperedges are cut • Hypergraph partitioning is a well-known problem from e.g., VLSI • We use HMETIS 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Joydeep Ghosh UT-ECE

Meta-CLustering Algorithm (MCLA) Build a meta-graph such that • vertex is cluster (vertex weight is cluster size) • edge weight is similarity between clusters • Similarity = intersection/union (Jaccard distance between ha and hb) • Balanced partitioning of this r-partite graph (METIS) • Assign each object to best matching meta-cluster Joydeep Ghosh UT-ECE

MCLA Example 1/2 • Vertex is cluster • Weight is Jaccard from haand hb Joydeep Ghosh UT-ECE

MCLA Example 2/2  :  :  : • In this illustrative example, CSPA, HGPA, and MCLA give the same result Joydeep Ghosh UT-ECE

Applications and Experiments • Proprietary Datasets • Data-sets • 2-dimensional 2-Gaussian simulated data(k=2, d=2, n=1000) • 5 Gaussians in 8-dimensions (k=5, d=8, n=1000) • Pen digit data (k=3, d=4, n=7494) • Yahoo news web-document data (k=40, d=2903, n=2340) • application setups • Feature Distributed Clustering (FDC) • Integrating clusters of varying resolution • Robust consensus • Object distributed clustering • EXTRINSIC EVALUATION Joydeep Ghosh UT-ECE

FDC Example • Data: 5 Gaussians in 8 dimensions • Experiment: 5 clusterings in 2-dimensional subspaces • Result: Avg. ind. 0.70, best ind. 0.77, ensemble 0.99 Joydeep Ghosh UT-ECE

Experimental Results FDC • Reference clustering and consensus clustering • Ensemble always equal or better than individual: • More than double the avg. individual quality in YAHOO! Joydeep Ghosh UT-ECE

Combining Clusterings of Different Resolutions Motivation • Robust combining of cluster solutions of different resolutions, produced in real life distributed data scenarios. • Ensemble helps estimate the “natural” number of clusters • Use ANMI to guide k Joydeep Ghosh UT-ECE

Experiments Table 1: Details of datasets and cluster ensembles with varying k. Joydeep Ghosh UT-ECE

Behavior of ANMI w.r.t. k (#clusters) Joydeep Ghosh UT-ECE

ANMI vs. NMI Correlation (match with ground truth - ) (match with consensus - ) correlation coeff 0.923, except Yahoo (0.61) Joydeep Ghosh UT-ECE

Results-1 Table 2: Quality of clusterings in terms of NMI w.r.t corresponding original categorization Joydeep Ghosh UT-ECE

Object Distributed Clustering (ODC) Scenario: Data is divided into p overlapping partitions; each object is on an average repeated  times. Advantages: • Distributed clustering • Speeds up when the inner clustering algorithm has super-linear complexity and a fast consensus function (MCLA, HGPA) is used. • For an O(n2) clustering algorithm and O(n) consensus function, • asymptotic sequential speedup is p/2 (e.g. Clustering YAHOO data can be sped up 64 fold with 16 processors retaining 80% of the full length quality, assuming repetition factor = 2) • Easily yields to p-fold parallelization Experiments: • individual partitions were clustered using graph partitioning • results were combined using the consensus framework. Joydeep Ghosh UT-ECE

ODC -Results Figure 3: Clustering quality (measured by relative mutual information) as a function of the number of partitions, p, on the various data sets (a) 2D2K (b)8D5K (c) PENDIG (d) YAHOO. The repetition factor  was set to 2 and graph partitioning was used for clustering the data. Joydeep Ghosh UT-ECE

Robust Consensus Clustering (RCC) • Goal: Create an `auto-focus´ clusterer that works for a wide variety of data-sets • Diverse portfolio of 10 approaches • SOM, HGP • GP (Eucl, Corr, Cosi, XJac) • KM (Eucl, Corr, Cosi, XJac) • Each approach is run on the same subsample of the data and the 10 clusterings combined using our supra-consensus function • Evaluation using increase in NMI of supra-consensus results increase over Random Joydeep Ghosh UT-ECE

Robustness Summary • Avg.qualityversusensemblequality • For severalsamplesizes n(50,100,200,400,800) • 10-fold exp. • ±1 standarddeviation bars Joydeep Ghosh UT-ECE

Remarks • Cluster ensembles • Enable knowledge reuse • Work with distributed data with strong privacy constraints • Improve quality & robustness • Are yet largely unexplored • Future work • Combining soft clusterings • Preferential consensus • What if (some) Features are known? • What if segments are ordered? • Applications & Data Sets • Bioinformatics • Papers, data, demos & code at http://strehl.com/ Joydeep Ghosh UT-ECE

A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing