280 likes | 418 Views
Iterative Optimization and Simplification of Hierarchical Clusterings. Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence Research, 4 (1996) 147-179 Presented by: Biyu Liang. Outline. Introduction Generating Initial Hierarchical Clustering
E N D
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science,Vanderbilt University Journal of Artificial Intelligence Research,4 (1996) 147-179 Presented by: Biyu Liang
Outline • Introduction • Generating Initial Hierarchical Clustering • Iterative Optimization Methods and Comparison • Simplification of Hierarchical Clustering • Conclusion
Introduction • Clustering is a process of unsupervised learning, which groups objects into clusters. • Major Clustering Methods • Partitioning • Hierarchical • Density-based • Grid-based • Model-based
Introduction (Continued) • Clustering systems differ in • objective function • control strategy • Usually a search strategy cannot be both computationally inexpensive and give any guarantee about the quality.
Introduction (Continued) • This paper discusses the use of iterative optimization and simplification to construct clusters that satisfy both conditions: • High quality • Computationally inexpensive • The suggested method involves 3 steps: • Constructing a initial clustering inexpensively • Iterative optimization to improve the clustering • Retrospective simplification of the clustering
Outline • Introduction • Generating Initial Hierarchical Clustering • Iterative Optimization Methods and Experiments • Simplification of Hierarchical Clustering • Conclusion
Category Utility • CU(CK) = P(Ck)ij[P(Ai =Vij |CK)2 -P(Ai= Vij)2] • PU({C1, C2, … CN}) = k CU(CK)/N Where an observation is a vector of Vij along attributes(or variables) Ai • This measure rewards clusters Ck, that increases the predictability of Vijwithin Ck (i.e. P(Ai=Vij|Ck)) relative to their predictability in the population as a whole (i.e. P(Ai= Vij))
Hierarchical Sorting • Given an observation and current partition, evaluate the quality of the clusterings that result from • Placing the observation in each of the existing clusters • Creating a new cluster that only covers the new observation • Select the option that yields the highest quality score (PU)
Outline • Introduction • Generating Initial Hierarchical Clustering • Iterative Optimization Methods and Comparison • Simplification of Hierarchical Clustering • Conclusion
Iterative Optimization Methods • Reorder-resort (Cluster/2): seed selection, reordering, and re-clustering. • Iterative redistribution of single observation: moving single observation one by one. • Iterative hierarchical redistribution: moving clusters together with its sub-tree.
Reorder-resort (k-mean) • k random seeds are selected, and k clusters are growing around these attractors • the centroids of the clusters are picked as new seeds, new clusters are growing • The process iterates until there is no further improvement in the quality of generated clustering
Reorder-resort (k-mean) con’t • Ordering data to make consecutive observations dissimilar leads to good clusterings • Extracting biased “dissimilarity” ordering from the hierarchical clustering • Initial sorting, extraction dissimilarity ordering, re-clustering
Iterative Redistribution of Single Observations • Moves single observations from cluster to cluster • A cluster contains only one observation is removed and its single observation is resorted • Iterate until two consecutive iterations yield the same clustering
Single Observation Redistribution Variations • The ISODATA algorithmdetermines a target cluster for each observation but does not move the cluster until targets for all observations have been determined • A sequential version that moves each observation as its target is identified through sorting
Iterative Hierarchical Redistribution • Takes large steps in the search for a better clustering • Remove and resorts sub-tree instead of single observation • Requires update variable value counts of ancestor clusters and host cluster
Scheme • Given an existing hierarchical clustering, a recursive loop examines sibling clusters in the hierarchy in a depth first fashion. • An inner, iterative loop reclassifies each sibling based on the objective function. And repeats until two consecutive iterations lead to the same set of siblings.
(Continued) • The recursive loop then turns its attention to the children of each of these remaining siblings. • Finally the leaves will be reached and resorted. • The recursive loop will be applied several times until there are no changes that occur from one pass to the next.
Main findings from the experiments • Hierarchical redistribution achieves the highest mean PU scores in most cases • Reordering and re-clustering comes closest to hierarchical redistribution’s performance in all cases • Single-observation redistribution modestly improves an initial sort, and is substantially worse than the other two optimization methods
Outline • Introduction • Generating Initial Hierarchical Clustering • Iterative Optimization Methods and Comparison • Simplification of Hierarchical Clustering • Conclusion
Simplifying Hierarchical Clustering • Simplify hierarchical clustering and minimize classification cost • Minimize Error Rate • Validation set to identify the frontier of clusters for prediction of each variable • Node lies below the frontier of every variable would be pruned
Validation • For each variable, Ai, the objects from the validation set are each classified through the hierarchical clustering with the value of variable Ai “masked” for purposes of classification. • At each cluster encountered during classification, prediction correct if the observation’s value for Ai is equal to the most frequent value for Ai at the cluster. • A Count of all correct predictions for each variable at a cluster is maintained. • A preferred frontier for each variable is identified that maximizes the number of correct counts for the variable.
Concluding Remarks • There are three phases in searching the space of hierarchical clusterings: • Inexpensive generation of an initial clustering • Iterative optimization for clusterings • Retrospective simplification of generated clusterings • Experiments found that the new method, hierarchical redistribution optimization works well
Final Exam Questions • The main idea in this paper is to construct clusterings which satisfy two conditions. • Name the conditions (p.5) • name the two steps to satisfy the conditions • Discribe the three iterative methods for clustering optimization (p.12-20) • The cluster is better when the relative CU score is a) big, b) small, c) equal to 0 (p.7) • Which sorting method is better? a) random sorting, b) similarity sorting (p.14)