210 likes | 452 Views
Adaptive Cluster Ensemble Selection. Javad Azimi, Xiaoli Fern {azimi, xfern}@eecs.oregonstate.edu Oregon State University Presenter: Javad Azimi. Cluster Ensembles. Data Set. Setting up different clustering methods. Clustering 1. Clustering 2. …………. Clustering n.
E N D
Adaptive Cluster Ensemble Selection Javad Azimi, Xiaoli Fern {azimi, xfern}@eecs.oregonstate.edu Oregon State University Presenter: Javad Azimi.
Cluster Ensembles Data Set Setting up different clustering methods. Clustering 1 Clustering 2 …………. Clustering n Generating different results. Result 1 Result 2 ……………. Result n Consensus Function Combine to obtain final results. Final Clusters
Cluster Ensembles:Challenge • One can easily generate hundreds or thousands of clustering results. • Is it good to always include all clustering results in the ensemble? • We may want to be selective. • Which subset is the best?
What makes a good ensemble? • Diversity • Members should be different from each other • Measured by Normalized Mutual Information (NMI) • Select a subset of ensemble members based on diversity: • Hadjitodorov et al. 2005: Ensemble with median diversity usually works better. • Fern and Lin 2008: Cluster ensemble members into distinct groups and then choose one from each group.
Diversity in Cluster Ensembles:Drawback • They aim to design selection heuristics without considering the characteristics of the data sets and ensembles. • Our goal: selecting adaptively based on the behavior of the data set and ensemble itself.
Our Approach • We empirically examined the behavior of the ensembles and the clustering performance on 4 different data sets. • Use the four training sets to learn an adaptive strategy • We evaluated the learned strategy on test data sets. • 4 training data sets: Iris, Soybean, Wine, Thyroid.
An Empirical Investigation • Generate a large ensemble • 100 independent runs of two different algorithms (K-means and MSF) • Analyze the diversity of the generated ensemble • Generate a final result P* based on all ensemble members • Compute the NMI between ensemble members and P* • Examine the distribution of the diversity • Consider different potential subsets selected based on diversity and evaluate their clustering performance
Observation #1 • There are two distinct types of ensembles • Stable: most ensemble members are similar to P* • Unstable: most ensemble members are different from P*. stable unstable # of ensembles NMI with P*
Consider DifferentSubsets • Compute the NMI between each member and P* • Sort NMI values • Consider 4 different subsets Low diversity (L) High diversity (H) Members sorted based on NMI values Medium diversity (M)
Observation #2 • Different subsets work the best for stable and unstable data: • Stable: subsets F and L worked well • Unstable: subset H worked well
Our final strategy • Generate a large ensemble П (200 solutions) • Obtain the consensus partition P* • Compute NMI between ensemble members and P* and sort them in decreasing order. • If average NMI > 0.5, classify ensemble as stable and output P* as the final partition • Otherwise, classify ensemble as non-stable and select the H (high diversity) subset, and output its consensus clustering.
Experimental Setup • 100 independent runs of k-means and MSF are used to generate the ensemble members. • Consensus function: average link HAC on the co-association matrix
Experimental Results:Selecting a Method vs Selecting the Best Ensemble Members • Which members are selected for final clustering? MSF K-means NMI with P* Only MSF members are selected MSF and K-means member are selected
Experimental Results:How accurate are the selected ensemble members? • x-axis: members in decreasing order of NMI values with P* • y-axis: their correspond NMI values with ground truth labels Selected ensemble members More accurate ensemble members are selected Most similar to P* Most dissimilar to P*
Conclusion • We empirically learned a simple ensemble selection strategy: • First classify an given ensemble as stable or unstable. • Then select a subset according to the classification result. • On separate test data sets, we achieve excellent results: • Some times significantly better than best ensemble member. • Outperforms an existing selection method.