180 likes | 194 Views
This paper introduces Corresponding Clustering, a method that analyzes related datasets by clustering them together based on correspondence. Traditional clustering algorithms are not suitable for such tasks. The paper defines the framework, discusses representative-based algorithms, and presents an experimental evaluation.
E N D
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer Science, University of Houston, USA Organization Motivation Analyzing Related Datasets Correspondence Clustering Definition Frameworks Representative-based Correspondence Clustering Algorithms Assessing Agreement between Related Datasets Experimental Evaluation Conclusion and Future Work
1. Motivation Clustering Related Datasets has many applications: • Relating habitats of animals and their source of food • Understanding change of ozone concentrations due to industrial emissions of other pollutants. • Analyzing changes in water temperature However, traditional clustering algorithms that cluster each dataset separately are not well suited to cluster related datasets: • They do not consider correspondence between the datasets • The variance inherent in most clustering algorithms complicates analyzing related datasets.
Rinsurakawong&Eick: Correspondence Clustering , PAKDD’10 2. Analyzing Related Datasets • Subtopics: • Disparity Analysis/Emergent Pattern Discovery (“how do two groups differ with respect to their patterns?”) • Change Analysis (“what is new/different?”) in temporal datasets; e.g. [CKT06] and [CSZHT07] utilize a concept of temporal smoothness that states that clustering results of data in two consecutive time frames should not be dramatically different. • Relational clustering [BBM07] clusters different types of objects based on their properties and relationships. • Co-clustering [DMM03] partition rows and columns of data matrix simultaneously to create clusters for two sets of objects. • Correspondence clustering centers on “mining good clusters in different datasets with interesting relationships between them”. 3
Clustering with Plug-in Fitness Functions • In the last 5 years, my research group developed families of clustering algorithms that find contiguous spatial clusters that by maximizing a plug-in fitness function. • This work is motivated by a mismatch between evaluation measures of traditional clustering algorithms (such as cluster compactness) and what domain experts are actually looking for. • The presented paper generalizes this work to mine multiple datasets. 4
3. Correspondence Clustering Definition: A correspondence clustering algorithm clusters data in two or more datasets O={O1,…,On} and generates clustering results X={X1,…,Xn} such that for 1in, Xi is created from Oi and the correspondence clustering algorithm seeks for clusters Xi’s such that each Xi maximizes interestingness i(Xi) with respect to Oi as well as maximizes the correspondence measure Corr(X1,…,Xn) between itself and the other clusterings Xj for 1 j n, ji.
Example Correspondence Clustering Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth. O1—Earthquakes 86-91 O2—Earthquakes 91-96 Find clusters X1 for O1 and X2 for O2 maximizing the following objective function: q̃(X1,X2)=(*(i(X1)+i(X2))) + ((1- )*(Agreement(X1,X2)) where i(X) measures the interestingness of X based on the variance in earthquake depth of earthquakes in the clusters of X and determines the importance of dataset cluster quality and agreement between the two clusterings.
What is unique about Corresponding Clustering? • Relies on clustering algorithms that support plug-in fitness functions, allowing for non-distance based notion of interestingness. • Geared towards analyzing spatial datasets; spatial attributes serve as glue to relate different spatial datasets • Corresponding clustering can be viewed as a multi-objective optimization problem in which we try to obtain good clusters in multiple datasets with good fit with respect to a given correspondence relationship. 7
Algorithms for Correspondence Clustering Rinsurakawong&Eick: Correspondence Clustering , PAKDD’10 • 2 groups of algorithms can be distinguished: • Iterative algorithms that improve the clustering of one dataset while keeping the clusters of other datasets fixed. • Concurrent algorithms that cluster all datasets in parallel. • In the following, an iterative representative-based correspondence clustering algorithm C-CLEVER-I will be briefly discussed. 8
Representative-based Clustering 2 Attribute1 1 3 Attribute2 4 Objective: Find a set of objects OR such that the clustering X obtained by using the objects in OR as representatives minimizes q(X). Characteristic: cluster are formed by assigning objects to the closest representative Popular Algorithms: K-means, K-medoids, CLEVER,…
CLEVER [ACM-GIS’08] Rinsurakawong&Eick: Correspondence Clustering , PAKDD’10 • Is a representative-based clustering algorithm, similar to PAM. • Searches variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity. • In general, new clusters are generated in the neighborhood of the current solution by replacing, inserting, and replacing representatives. • Searches for optimal number of clusters 10
C-CLEVER-I • Inputs: O1 and O2, TCond, k’, neighborhood-size, p, p’, • Output: X1, X2, q(X1), q(X2), q̃(X1,X2), Corr(X1,X2) • Algorithm: • 1. Run CLEVER on dataset O1 with fitness function q and get clustering result X1 and a set of representative R1: • (X1,R1) :=Run CLEVER(O1, q); • 2. Repeat until the Termination Condition TCond is met. • a. Run CLEVER on dataset O2 with compound fitness function q̃2 that uses the representatives R1 to calculate Corr(X1,X2): • (X2,R2) :=Run CLEVER(O2,R1, q̃2) • b. Run CLEVER on dataset O1 with compound fitness function q̃1 that uses the representatives R2 to calculate Corr(X1,X2): • (X1,R1) :=Run CLEVER(O1,R2, q̃1) Outputs and Fitness Functions: X1, X2 are clusterings of O1 and O2 q is a single dataset fitness function q̃(X1,X2)=(*(q(X1)+q(X2))) + ((1- )*(Corr(X1,X2)) q̃1(X1)=(*(q(X1)) + ((1- )*(Corr(X1,X2))—q with X2 fixed q̃2(X2)=(*(q(X2)) + ((1- )*(Corr(X1,X2))—q with X1 fixed
4. Assessing Agreement between Clusterings For Related Datasets • We assume that the two datasets share the same spatial attributes. However, the challenge of this task is that we do not have object identity; otherwise, we could directly compare the co-occurrence matrices MX1 and MX2 of the two clusterings. • Key Idea: We can use the representatives of the clusters in one dataset to cluster the other dataset; then we can compute the similarity of original clustering with the clustering obtained using the representatives of the other dataset for both data sets. Finally, we assess similarity by averaging over these two similarities:
5. Experimental Evaluation Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth. O1 O2 Find clusters X1 for O1 and X2 for O2 maximizing the following objective function: q̃(X1,X2)=(*(i(X1)+i(X2))) + ((1- )*(Agreement(X1,X2)) What is done in the experimental evaluation? We compare running C-CLEVER-I with running CLEVER Analyze the potential of using agreement as a correspondence function to reduce the variance of clustering results. We analyze different initialization strategies and parameter settings.
Comparing CLEVER and C-CLEVER-I Table 5-3. Comparison of average results of CLEVER and C-CLEVER-I Comparison between CLEVER and C-CLEVER-I
Different Initialization Strategies Table 5-3. Comparison of average results of CLEVER and C-CLEVER-I • The following initialization strategies for C-CLEVER-I have been explored: • Use random initial representatives • Use the nearest neighbors of the final representatives of the last iteration for the other dataset • (3) Use the final representatives from the previous iteration for the same • dataset Comparison between Different Initialization Strategies
6. Conclusion • A representative-based correspondence clustering framework has been introduced that relies on plug-in clustering and correspondence functions. • Correspondence clustering algorithms that are generalizations of an algorithm called CLEVER were presented. • Our experimental results suggest that correspondence clustering can reduce the variance inherent to representative-based clustering algorithms. Since the two datasets are related to each other, using one dataset to supervise the clustering of the other dataset can lead to more reliable clusterings. • As a by-product, a novel agreement assessment method to compare representative-based clusterings that originate from different dataset has been introduced.
Future Work • What about other correspondence measures, besides agreement and disagreement? What about applications that look for other forms of correspondence between clusters originating from different spatial datasets? • Several implementation strategies for concurrent correspondence clustering are possible: • Cluster each dataset for a few iterations and switch • Find clusters for both datasets using some spatial traversal approach, creating clusters for subregions • …
CLEVER Inputs: Dataset O, k’, neighborhood-size, p, p’, Outputs: Clustering X, fitness q Algorithm: 1. Create a current solution by randomly selecting k’ representatives from O. 2. Create p neighbors of the current solution randomly using the given neighborhood definition. 3. If the best neighbor improves the fitness q, it becomes the current solution. Go back to step 2. 4. If the fitness does not improve, the solution neighborhood is re-sampled by generating p’ more neighbors. If re-sampling does not lead to a better solution, terminate returning the current solution; otherwise, go back to step 2 replacing the current solution by the best solution found by re- sampling.