Semi-Supervised Clustering

Semi-Supervised Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu

Outline of lecture • Overview of clustering and classification • What is semi-supervised learning? • Semi-supervised clustering • Semi-supervised classification • Semi-supervised clustering • What is semi-supervised clustering? • Why semi-supervised clustering? • Semi-supervised clustering algorithms

Supervised classification versus unsupervised clustering • Unsupervised clustering • Group similar objects together to find clusters • Minimize intra-class distance • Maximize inter-class distance • Supervised classification • Class label for each training sample is given • Build a model from the training data • Predict class label on unseen future data points

Inter-cluster distances are maximized Intra-cluster distances are minimized What is clustering? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

What is Classification?

Clustering algorithms • K-Means • Hierarchical clustering • Graph based clustering (Spectral clustering)

Classification algorithms • Decision Trees • Naïve Bayes classifier • Support Vector Machines (SVM) • K-Nearest-Neighbor classifiers • Logistic Regression • Neural Networks • Linear Discriminant Analysis (LDA)

Supervised Classification Example . . . .

Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .

Semi-supervised learning Supervised classification Unsupervised clustering Semi-Supervised Learning • Combines labeled and unlabeled data during training to improve performance: • Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. • Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.

Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

Semi-Supervised Classification • Algorithms: • Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. • Co-training [Blum:COLT98]. • Transductive SVM’s [Vapnik:98,Joachims:ICML99]. • Graph based algorithms • Assumptions: • Known, fixed set of categories given in the labeled data. • Goal is to improve classification of examples into these known categories. • More discussions next week

Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

Semi-supervised clustering: problem definition • Input: • A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical) • A small amount of domain knowledge • Output: • A partitioning of the objects into k clusters (possibly with some discarded as outliers) • Objective: • Maximum intra-cluster similarity • Minimum inter-cluster similarity • High consistency between the partitioning and the domain knowledge

Why semi-supervised clustering? • Why not clustering? • The clusters produced may not be the ones required. • Sometimes there are multiple possible groupings. • Why not classification? • Sometimes there are insufficient labeled data. • Potential applications • Bioinformatics (gene and protein clustering) • Document hierarchy construction • News/email categorization • Image categorization

This class Next class Semi-Supervised Clustering • Domain knowledge • Partial label information is given • Apply some constraints (must-links and cannot-links) • Approaches • Search-based Semi-Supervised Clustering • Alter the clustering algorithm using the constraints • Similarity-based Semi-Supervised Clustering • Alter the similarity measure based on the constraints • Combination of both

Search-Based Semi-Supervised Clustering • Alter the clustering algorithm that searches for a good partitioning by: • Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. • Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. • Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans,) [Basu:ICML02].

Overview of K-Means Clustering • K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: • Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum • Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

K-Means Objective Function • Locally minimizes sum of squared distance between the data points and their correspondingcluster centers: • Initialization of K cluster centers: • Totally random • Random perturbation from global mean • Heuristic to ensure well-separated centers

K Means Example

K Means ExampleRandomly Initialize Means x x

K Means ExampleAssign Points to Clusters x x

K Means ExampleRe-estimate Means x x

K Means ExampleRe-assign Points to Clusters x x

K Means ExampleRe-estimate Means x x

K Means ExampleRe-assign Points to Clusters x x

K Means ExampleRe-estimate Means and Converge x x

Semi-Supervised K-Means • Partial label information is given • Seeded K-Means • Constrained K-Means • Constraints (Must-link, Cannot-link) • COP K-Means

Semi-Supervised K-Means for partially labeled data • Seeded K-Means: • Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. • Seed points are only used for initialization, and not in subsequent steps. • Constrained K-Means: • Labeled data provided by user are used to initialize K-Means algorithm. • Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.

Seeded K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points may change.

Seeded K-Means Example

Seeded K-Means ExampleInitialize Means Using Labeled Data x x

Seeded K-Means ExampleAssign Points to Clusters x x

Seeded K-Means ExampleRe-estimate Means x x

Seeded K-Means ExampleAssign points to clusters and Converge x the label is changed x

Constrained K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points will not change.

Constrained K-Means Example

Constrained K-Means ExampleInitialize Means Using Labeled Data x x

Constrained K-Means ExampleAssign Points to Clusters x x

x x Constrained K-Means ExampleRe-estimate Means and Converge

COP K-Means • COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. • Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). • Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.

COP K-Means Algorithm

Illustration Determine its label Must-link x x Assign to the red class

Semi-Supervised Clustering