240 likes | 334 Views
An Architecture and Algorithms for Multi-Run Clustering. Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science Department University of Houston, TX. Outline. Motivation Goals Overviews Related work
E N D
An Architecture and Algorithms for Multi-Run Clustering Rachsuda Jiamthapthaksin, Christoph F. Eick and VadeeratRinsurongkawong Computer Science Department University of Houston, TX
Outline • Motivation • Goals • Overviews • Related work • An architecture and algorithms for multi-run clustering • Experimental results • Conclusion and future works
1. Motivation Region discovery framework Region discovery framework Domain experts A family of clustering algorithms A family of clustering algorithms Multi-run clustering Manually select parameters of clustering algorithms Rely on active learning to automatically select parameters of clustering algorithms A family of plug-in fitness functions A family of plug-in fitness functions Cougar^2: Open Source Data Mining and Machine Learning Framework https://cougarsquared.dev.java.net
2. Goals • Given O = {o1,…,on} as a spatial dataset. A clustering algorithm seeks for a clustering X that maximizes a fitness function q(X). X = {x1, x2,…,xk}, xixj = , (ij), , and • The goal isto automatically find a set of distinct and high quality clusters that originate from different runs.
3. Overviews of multi-run clustering –1 • Key hypothesis: better clustering results can be obtained by combining clusters that originate from multiple runs of a clustering algorithm.
3. Overviews of multi-run clustering –2 • Challenges: • Selecting appropriate parameters for an arbitrary clustering algorithm • Determining which clusters to be stored as candidate clusters. • Generating a final clustering from candidate clusters • Alternative clusters, e.g. hotspots in spatial datasets at different granularities
4. Related work • Meta clustering [Caruana et al. 2006]: early create diverse clusterings, cluster them into groups afterward, and finally let users choose a group of clusterings that is the best for their needs. • Ensemble clustering [Gionis et al. 2005; Zeng et al. 2002]: aggregates different clusterings into one consolidated clustering
Definition of a state • A state s in a state space S (SR2bm) : s = {s1_min, s1_max,…, sm_min, sm_max}, si 2b • A state s for CLEVER s = {k’min, k’max, pmin, pmax, p’min, p’max}
5. An architecture of multi-run clustering system S1 S3 S4 S2 Parameters State Utility Learning Clustering Algorithm X M X Steps in multi-run clustering: S1: Parameter selection. S2: Run a clustering algorithm. S3: Compute a state feedback. S4: Update the state utility table. S5: Update the cluster list M. S6: Summarize clusters discovered M’. S5 Storage Unit M Cluster Summarization Unit S6 M’
Pre-processing step.Compute necessary statistics to set up multi-run clustering system. we run m rounds of CLEVER by randomly selecting k’, p and p’. S0 State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit
Step 1. Select parameters of a clustering algorithm. S1 State Utility Learning Clustering Algorithm Storage Unit P(1) = 0.2, P(2) = 0.6, P(3) = 0.2. s1 = {k’min=1, k’max=10, pmin=1, pmax=10, p’min=11, p’max=20} Cluster Summarization Unit s2 = {k’min=11, k’max=20, pmin=41, pmax=50, p’min=31, p’max=40} Selected state: {k’=12, p=45, p’=40}
Step 2. Run CLEVER to generate a clustering with respect to given parameters. k’=12, p=45, p’=40 S2 Parameters State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit Fitness Function:
Step 3. Compute a state utility. A relative clustering quality function (RCQ) S3 RCQ(X,M) = Novelty(X,M) x ||Speed(X)|| x ||q(X)|| State Utility Learning Clustering Algorithm Novelty(X,M) = (1 - similarity(X,M)) Enhancement(X,M) X M Storage Unit X = {x1,…,xk}, and yi be the most similar cluster in the stored cluster list M to xiX. Cluster Summarization Unit
Step 4.Update a state utility. S4 Utility Update U’ State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit
Step 5.Update cluster lists to maintain a set of distinct and high quality clusters. State Utility Learning Clustering Algorithm X S5 Storage Unit Cluster Summarization Unit
Dominance graphs Step 6. Generate a final clustering. 0.8 A : a dominant cluster : dominated clusters 0.7 0.3 B C State Utility Learning Clustering Algorithm D E F Storage Unit A M Cluster Summarization Unit S6 D 0.7 A D 0.8 M’ E F Dominance-guided Cluster Reduction algorithm (DCR)
6. Experimental evaluation – 1 • Evaluation of multi-run clustering on earthquake dataset* • Show howmulti-run clustering can discover interesting and alternative clusters in spatial data. • Be interested in areas where deep earthquakes are in close proximity to shallow earthquakes. • Use the High Variance function (i(c)) [Rinsurongkawong 2008] to find such regions. *: earthquake dataset is available on the website of the U.S. Geological Survey Earthquake Hazards Program http://earthquake.usgs.gov/.
6. Experimental evaluation – 2 Fig. 6. Top 5 clusters of XTheBestRun (ordered by reward) Fig. 7. Multi-run clustering results: clusters in M’.
6. Experimental evaluation – 3 • Our system can find 70% of the new and high-quality clusters that do not exist in the best single run. • With overlapping threshold of 0.2, there are 43% of the positive-reward clusters of the best run are not in M’.
6. Experimental evaluation – 4 Fig. 8. Overlay the multi-run clustering result (in color) by the top 5 rewards clusters of the best run (in black).
7. Conclusion – 1 • Propose an architecture and a concrete system for multi-run clustering to cope with parameters selection of a clustering algorithm, and to obtain alternative clusters in highly automated fashion; • Uses active learning to automate the parameter selection, and various techniques to find both different clusters and good clusters on the fly. • Propose Dominance-guided Cluster Reduction algorithm that post-processes clusters from the multiple runs to generate a final clustering by restricting cluster overlap.
7. Conclusion – 2 • The experimental result on earthquake dataset supports our claim that multi-run clustering outperforms single-run clustering with respect to clustering quality. • Multi-run clustering can discover additional novel, alternative, high-quality clusters and enhance the quality of clusters found using single-run clustering.
7. Future work • Systematically evaluate the use of utility learning in choosing parameters of a clustering algorithm. • Ultimate goal is to construct multi-run multi-objective clustering in one system.