A Probabilistic Framework for Semi-Supervised Clustering

A Probabilistic Framework for Semi-Supervised Clustering Sugato Basu Mikhail Bilenko Raymond J. Mooney Deptment of Computer Sciences University of Texas at Austin Presented by Jingting Zeng

Outline • Introduction • Background • Algorithm • Experiments • Conclusion

What is Semi-Supervised Clustering? • Use human input to provide labels for some of the data • Improve existing naive clustering methods • Use labeled data to guide clustering of unlabeled data • End result is a better clustering of data

Motivation • Large amounts of unlabeled data exists • More is being produced all the time • Expensive to generate Labels for data • Usually requires human intervention • Want to use limited amounts of labeled data to guide the clustering of the entire dataset in order to provide a better clustering method

Semi-Supervised Clustering • Constraint-Based: • Modify objective functions so that it includes satisfying constraints, enforcing constraints during the clustering or initialization process • Distance-Based: • A distance function is used that is trained on the supervised dataset to satisfy the labels or constraint, and then is applied to the complete dataset

Method • Use both constraint-based and distance based approaches in a unified method • Use Hidden Markov Random Fields to generate constraints • Use constraints for initialization and assignment of points to clusters • Use an adaptive distance function to try to learn the distance measure • Cluster data to minimize some objective distance function

Main Points of Method • Improved initialization: • Initial clusters are formed based on constraints • Constraint sensitive assignment of points to clusters • Points are assigned to clusters to minimize a distortion function, while minimizing the number of constraints violated • Iterative Distance Learning • Distortion measure is re-estimated after each iteration

Constraints • Pairwise constraints of Must-link, or Cannot-link labels • Set M of must link constraints • Set C of cannot link constraints • A list of associated costs for violating Must-link or cannot-link requirements • Class labels do not have to be known, but a user can still specify relationship between points.

HMRF

Posterior probability • This problem is an “incomplete-data problem” • Cluster representative as well as class labels are unknown • Popular method for solving this type of problem is Expectation Maximization • K-Means is equivalent to an EM algorithm with hard clustering assignments

Must-Link Violation Cost Function • Ensures that the penalty for violating the must-link constraint between 2 points that are far apart is higher than between 2 points that are close • Punishes distance functions in which must-link points are far apart

Cannot-Link Violation Cost Function • Ensures that the penalty for violating cannot-link constraints between points that are nearby according to the current distance function is higher than between distant points • Punishes distance functions that place 2 cannot link points close together

Objective Function • Goal: find minimum objective function • Supervised data in initialization • Constraints in cluster assignments • Distance learning in M step

Algorithm • EM framework • Initialization step: • Use constraints to guide initial cluster formations • E step: • minimizes objective function over cluster assignment • M step: • Minimizes objective function over cluster representatives • Minimizes objective function over the parameter distortion measure

Initialization • Initialize: • Form transitive closure of the must-link constraints • Set of connected components consisting of points connected by must-link constraints • y connected components • If y < K (number of clusters), y connected neighborhoods used to create y initial clusters • Remaining clusters initialized by random perturbations of the global centroid of the data

What If More Neighborhoods Then Clusters? • If y > K (number of clusters), k initial clusters are selecting using the distance measure • Farthest first traversal is a good heuristic • Weighted variant of farthest first traversal • Distance between 2 centroids multiplied by their corresponding weights • Weight of each centroid is proportional to the size of the corresponding neighborhood. • Biased to select centroids that are relatively far apart and of a decent size

Initialization Continued • Assuming consistency of data: • Augment set M with must-link constraints inferred from transitive closure • For each pair of neighborhoods Np ,Np’ that have at least one cannont link constraint between them, add connot-link constraints between every member of Np and Np’ . • Learn as much about the data through the constraints as possible

Augment set M a b c Must-link Must-link Inferred Must-link

Augment Set C a b c Must-link Cannot-link Inferred Cannot-link

E-step • Assignments of data points to clusters • Since model incorporates interactions between points, computing point assignments to minimize objective function is computationally intractable • Iterated conditional models, belief propagation, and linear programming relaxation • ICM: uses greedy strategy to sequentially update the cluster assignment of each point while keeping other points fixed.

M-step • First, cluster representatives are re-estimated to decrease objective function • Constraints do not factor into this step, so it is equivalent to K-Means • If parameterized variant of a distance measure is used, it is updated here • Parameters can be found through partial derivatives of distance function • Learning step results in modifying the distortion measure so that similar points are brought closer together, while dissimilar points are pulled apart

Results • KMeans – I – C – D : • Complete HMRF- KMeans algorithm, with supervised data in initialization(I) and cluster assignment(C) and distance learning(D) • KMEANS – I- C: • HMRF-KMeans algorithm without distance learning • KMEANS –I : • HMRF-KMeans algorithm without distance learning and supervised cluster assignment

Results

Conclusion • HMRF-KMeans Performs well (compared to naïve K-Means) with a limited number of constraints • The goal of the algorithm was to provide a better clustering method with the use of a limited number of constraints • HMRF-KMeans learns quickly from a limited number of constraints • Should be applicable to data sets where we want to limit the amount of labeling needed to be done by humans, and constraints can be specified in pairwise labels

Questions • Can all types of constraints be captured in pairwise associations? • Hierarchal structure? • Could other types of labels be included in this model? • Use class labels as well as pairwise constraints • How does this model handle noise in the data/labels? • Point A has must link constraint to Point B, Point B has must list constraint to Point C, Point A has Cannot-link constraint to point C

More Questions • How does this apply to other types of Data? • Authors mention wanting to try applying method to other types of data in the future, such as gene representation • Who provides weights for function violations, and how are weights determined? • Only compared with naïve KMeans method • How does it compare with other semi-supervised clustering methods?

Reference • S. Basu, M. Bilenko, and R.J. Mooney, “A Probabilistic Framework for Semi-Supervised Clustering,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), Aug. 2004.

Thank you!

A Probabilistic Framework for Semi-Supervised Clustering

A Probabilistic Framework for Semi-Supervised Clustering

Presentation Transcript

Semi-supervised Learning

Semi-Supervised Clustering I

A Probabilistic Framework for Semi-Supervised Clustering

Semi-Supervised Clustering II

Semi-Supervised Learning

Semi-Supervised Learning

A ctive Learning of Instance-level Constraints for Semi-supervised Document Clustering

Semi-supervised learning

Semi-Supervised Learning

Supervised Clustering

Probabilistic Graphical Models for Semi-Supervised Traffic Classification

Supervised and semi-supervised learning for NLP

Semi-Supervised Clustering

Efficient Semi-supervised Spectral Co-clustering with Constraints

A Semi-supervised Document Clustering Algorithm based on EM

Semi-Supervised Learning

Semi-Supervised Clustering

Semi-Supervised Learning

Semi-Supervised Learning

Semi-Supervised Clustering

Semi-Supervised Learning