300 likes | 460 Views
A Probabilistic Framework for Semi-Supervised Clustering. Sugato Basu Mikhail Bilenko Raymond J. Mooney Dept ment of Computer Sciences University of Texas at Austin Presented by Jingting Zeng. Outline. Introduction Background Algorithm Experiments Conclusion.
E N D
A Probabilistic Framework for Semi-Supervised Clustering Sugato Basu Mikhail Bilenko Raymond J. Mooney Deptment of Computer Sciences University of Texas at Austin Presented by Jingting Zeng
Outline • Introduction • Background • Algorithm • Experiments • Conclusion
What is Semi-Supervised Clustering? • Use human input to provide labels for some of the data • Improve existing naive clustering methods • Use labeled data to guide clustering of unlabeled data • End result is a better clustering of data
Motivation • Large amounts of unlabeled data exists • More is being produced all the time • Expensive to generate Labels for data • Usually requires human intervention • Want to use limited amounts of labeled data to guide the clustering of the entire dataset in order to provide a better clustering method
Semi-Supervised Clustering • Constraint-Based: • Modify objective functions so that it includes satisfying constraints, enforcing constraints during the clustering or initialization process • Distance-Based: • A distance function is used that is trained on the supervised dataset to satisfy the labels or constraint, and then is applied to the complete dataset
Method • Use both constraint-based and distance based approaches in a unified method • Use Hidden Markov Random Fields to generate constraints • Use constraints for initialization and assignment of points to clusters • Use an adaptive distance function to try to learn the distance measure • Cluster data to minimize some objective distance function
Main Points of Method • Improved initialization: • Initial clusters are formed based on constraints • Constraint sensitive assignment of points to clusters • Points are assigned to clusters to minimize a distortion function, while minimizing the number of constraints violated • Iterative Distance Learning • Distortion measure is re-estimated after each iteration
Constraints • Pairwise constraints of Must-link, or Cannot-link labels • Set M of must link constraints • Set C of cannot link constraints • A list of associated costs for violating Must-link or cannot-link requirements • Class labels do not have to be known, but a user can still specify relationship between points.
Posterior probability • This problem is an “incomplete-data problem” • Cluster representative as well as class labels are unknown • Popular method for solving this type of problem is Expectation Maximization • K-Means is equivalent to an EM algorithm with hard clustering assignments
Must-Link Violation Cost Function • Ensures that the penalty for violating the must-link constraint between 2 points that are far apart is higher than between 2 points that are close • Punishes distance functions in which must-link points are far apart
Cannot-Link Violation Cost Function • Ensures that the penalty for violating cannot-link constraints between points that are nearby according to the current distance function is higher than between distant points • Punishes distance functions that place 2 cannot link points close together
Objective Function • Goal: find minimum objective function • Supervised data in initialization • Constraints in cluster assignments • Distance learning in M step
Algorithm • EM framework • Initialization step: • Use constraints to guide initial cluster formations • E step: • minimizes objective function over cluster assignment • M step: • Minimizes objective function over cluster representatives • Minimizes objective function over the parameter distortion measure
Initialization • Initialize: • Form transitive closure of the must-link constraints • Set of connected components consisting of points connected by must-link constraints • y connected components • If y < K (number of clusters), y connected neighborhoods used to create y initial clusters • Remaining clusters initialized by random perturbations of the global centroid of the data
What If More Neighborhoods Then Clusters? • If y > K (number of clusters), k initial clusters are selecting using the distance measure • Farthest first traversal is a good heuristic • Weighted variant of farthest first traversal • Distance between 2 centroids multiplied by their corresponding weights • Weight of each centroid is proportional to the size of the corresponding neighborhood. • Biased to select centroids that are relatively far apart and of a decent size
Initialization Continued • Assuming consistency of data: • Augment set M with must-link constraints inferred from transitive closure • For each pair of neighborhoods Np ,Np’ that have at least one cannont link constraint between them, add connot-link constraints between every member of Np and Np’ . • Learn as much about the data through the constraints as possible
Augment set M a b c Must-link Must-link Inferred Must-link
Augment Set C a b c Must-link Cannot-link Inferred Cannot-link
E-step • Assignments of data points to clusters • Since model incorporates interactions between points, computing point assignments to minimize objective function is computationally intractable • Iterated conditional models, belief propagation, and linear programming relaxation • ICM: uses greedy strategy to sequentially update the cluster assignment of each point while keeping other points fixed.
M-step • First, cluster representatives are re-estimated to decrease objective function • Constraints do not factor into this step, so it is equivalent to K-Means • If parameterized variant of a distance measure is used, it is updated here • Parameters can be found through partial derivatives of distance function • Learning step results in modifying the distortion measure so that similar points are brought closer together, while dissimilar points are pulled apart
Results • KMeans – I – C – D : • Complete HMRF- KMeans algorithm, with supervised data in initialization(I) and cluster assignment(C) and distance learning(D) • KMEANS – I- C: • HMRF-KMeans algorithm without distance learning • KMEANS –I : • HMRF-KMeans algorithm without distance learning and supervised cluster assignment
Conclusion • HMRF-KMeans Performs well (compared to naïve K-Means) with a limited number of constraints • The goal of the algorithm was to provide a better clustering method with the use of a limited number of constraints • HMRF-KMeans learns quickly from a limited number of constraints • Should be applicable to data sets where we want to limit the amount of labeling needed to be done by humans, and constraints can be specified in pairwise labels
Questions • Can all types of constraints be captured in pairwise associations? • Hierarchal structure? • Could other types of labels be included in this model? • Use class labels as well as pairwise constraints • How does this model handle noise in the data/labels? • Point A has must link constraint to Point B, Point B has must list constraint to Point C, Point A has Cannot-link constraint to point C
More Questions • How does this apply to other types of Data? • Authors mention wanting to try applying method to other types of data in the future, such as gene representation • Who provides weights for function violations, and how are weights determined? • Only compared with naïve KMeans method • How does it compare with other semi-supervised clustering methods?
Reference • S. Basu, M. Bilenko, and R.J. Mooney, “A Probabilistic Framework for Semi-Supervised Clustering,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), Aug. 2004.