1 / 24

An Architecture and Algorithms for Multi-Run Clustering

An Architecture and Algorithms for Multi-Run Clustering. Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science Department University of Houston, TX. Outline. Motivation Goals Overviews Related work

tameron
Download Presentation

An Architecture and Algorithms for Multi-Run Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Architecture and Algorithms for Multi-Run Clustering Rachsuda Jiamthapthaksin, Christoph F. Eick and VadeeratRinsurongkawong Computer Science Department University of Houston, TX

  2. Outline • Motivation • Goals • Overviews • Related work • An architecture and algorithms for multi-run clustering • Experimental results • Conclusion and future works

  3. 1. Motivation Region discovery framework Region discovery framework Domain experts A family of clustering algorithms A family of clustering algorithms Multi-run clustering Manually select parameters of clustering algorithms Rely on active learning to automatically select parameters of clustering algorithms A family of plug-in fitness functions A family of plug-in fitness functions Cougar^2: Open Source Data Mining and Machine Learning Framework https://cougarsquared.dev.java.net

  4. 2. Goals • Given O = {o1,…,on} as a spatial dataset. A clustering algorithm seeks for a clustering X that maximizes a fitness function q(X). X = {x1, x2,…,xk}, xixj = , (ij), , and • The goal isto automatically find a set of distinct and high quality clusters that originate from different runs.

  5. 3. Overviews of multi-run clustering –1 • Key hypothesis: better clustering results can be obtained by combining clusters that originate from multiple runs of a clustering algorithm.

  6. 3. Overviews of multi-run clustering –2 • Challenges: • Selecting appropriate parameters for an arbitrary clustering algorithm • Determining which clusters to be stored as candidate clusters. • Generating a final clustering from candidate clusters • Alternative clusters, e.g. hotspots in spatial datasets at different granularities

  7. 4. Related work • Meta clustering [Caruana et al. 2006]: early create diverse clusterings, cluster them into groups afterward, and finally let users choose a group of clusterings that is the best for their needs. • Ensemble clustering [Gionis et al. 2005; Zeng et al. 2002]: aggregates different clusterings into one consolidated clustering

  8. Definition of a state • A state s in a state space S (SR2bm) : s = {s1_min, s1_max,…, sm_min, sm_max}, si 2b • A state s for CLEVER s = {k’min, k’max, pmin, pmax, p’min, p’max}

  9. 5. An architecture of multi-run clustering system S1 S3 S4 S2 Parameters State Utility Learning Clustering Algorithm X M X Steps in multi-run clustering: S1: Parameter selection. S2: Run a clustering algorithm. S3: Compute a state feedback. S4: Update the state utility table. S5: Update the cluster list M. S6: Summarize clusters discovered M’. S5 Storage Unit M Cluster Summarization Unit S6 M’

  10. Pre-processing step.Compute necessary statistics to set up multi-run clustering system. we run m rounds of CLEVER by randomly selecting k’, p and p’. S0 State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit

  11. Step 1. Select parameters of a clustering algorithm. S1 State Utility Learning Clustering Algorithm Storage Unit P(1) = 0.2, P(2) = 0.6, P(3) = 0.2. s1 = {k’min=1, k’max=10, pmin=1, pmax=10, p’min=11, p’max=20} Cluster Summarization Unit s2 = {k’min=11, k’max=20, pmin=41, pmax=50, p’min=31, p’max=40} Selected state: {k’=12, p=45, p’=40}

  12. Step 2. Run CLEVER to generate a clustering with respect to given parameters. k’=12, p=45, p’=40 S2 Parameters State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit Fitness Function:

  13. Step 3. Compute a state utility. A relative clustering quality function (RCQ) S3 RCQ(X,M) = Novelty(X,M) x ||Speed(X)|| x ||q(X)|| State Utility Learning Clustering Algorithm Novelty(X,M) = (1 - similarity(X,M))  Enhancement(X,M) X M Storage Unit X = {x1,…,xk}, and yi be the most similar cluster in the stored cluster list M to xiX. Cluster Summarization Unit

  14. Step 4.Update a state utility. S4 Utility Update U’ State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit

  15. Step 5.Update cluster lists to maintain a set of distinct and high quality clusters. State Utility Learning Clustering Algorithm X S5 Storage Unit Cluster Summarization Unit

  16. Dominance graphs Step 6. Generate a final clustering. 0.8 A : a dominant cluster : dominated clusters 0.7 0.3 B C State Utility Learning Clustering Algorithm D E F Storage Unit A M Cluster Summarization Unit S6 D 0.7 A D 0.8 M’ E F Dominance-guided Cluster Reduction algorithm (DCR)

  17. 6. Experimental evaluation – 1 • Evaluation of multi-run clustering on earthquake dataset* • Show howmulti-run clustering can discover interesting and alternative clusters in spatial data. • Be interested in areas where deep earthquakes are in close proximity to shallow earthquakes. • Use the High Variance function (i(c)) [Rinsurongkawong 2008] to find such regions. *: earthquake dataset is available on the website of the U.S. Geological Survey Earthquake Hazards Program http://earthquake.usgs.gov/.

  18. 6. Experimental evaluation – 2 Fig. 6. Top 5 clusters of XTheBestRun (ordered by reward) Fig. 7. Multi-run clustering results: clusters in M’.

  19. 6. Experimental evaluation – 3 • Our system can find 70% of the new and high-quality clusters that do not exist in the best single run. • With overlapping threshold of 0.2, there are 43% of the positive-reward clusters of the best run are not in M’.

  20. 6. Experimental evaluation – 4 Fig. 8. Overlay the multi-run clustering result (in color) by the top 5 rewards clusters of the best run (in black).

  21. 7. Conclusion – 1 • Propose an architecture and a concrete system for multi-run clustering to cope with parameters selection of a clustering algorithm, and to obtain alternative clusters in highly automated fashion; • Uses active learning to automate the parameter selection, and various techniques to find both different clusters and good clusters on the fly. • Propose Dominance-guided Cluster Reduction algorithm that post-processes clusters from the multiple runs to generate a final clustering by restricting cluster overlap.

  22. 7. Conclusion – 2 • The experimental result on earthquake dataset supports our claim that multi-run clustering outperforms single-run clustering with respect to clustering quality. • Multi-run clustering can discover additional novel, alternative, high-quality clusters and enhance the quality of clusters found using single-run clustering.

  23. 7. Future work • Systematically evaluate the use of utility learning in choosing parameters of a clustering algorithm. • Ultimate goal is to construct multi-run multi-objective clustering in one system.

  24. Thank you

More Related