550 likes | 569 Views
Semi-Supervised Clustering. CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu. Outline of lecture. Overview of clustering and classification What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification Semi-supervised clustering
E N D
Semi-Supervised Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu
Outline of lecture • Overview of clustering and classification • What is semi-supervised learning? • Semi-supervised clustering • Semi-supervised classification • Semi-supervised clustering • What is semi-supervised clustering? • Why semi-supervised clustering? • Semi-supervised clustering algorithms
Supervised classification versus unsupervised clustering • Unsupervised clustering • Group similar objects together to find clusters • Minimize intra-class distance • Maximize inter-class distance • Supervised classification • Class label for each training sample is given • Build a model from the training data • Predict class label on unseen future data points
Inter-cluster distances are maximized Intra-cluster distances are minimized What is clustering? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Clustering algorithms • K-Means • Hierarchical clustering • Graph based clustering (Spectral clustering)
Classification algorithms • Decision Trees • Naïve Bayes classifier • Support Vector Machines (SVM) • K-Nearest-Neighbor classifiers • Logistic Regression • Neural Networks • Linear Discriminant Analysis (LDA)
Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Semi-supervised learning Supervised classification Unsupervised clustering Semi-Supervised Learning • Combines labeled and unlabeled data during training to improve performance: • Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. • Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.
Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .
Semi-Supervised Classification • Algorithms: • Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. • Co-training [Blum:COLT98]. • Transductive SVM’s [Vapnik:98,Joachims:ICML99]. • Graph based algorithms • Assumptions: • Known, fixed set of categories given in the labeled data. • Goal is to improve classification of examples into these known categories. • More discussions next week
Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .
Semi-supervised clustering: problem definition • Input: • A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical) • A small amount of domain knowledge • Output: • A partitioning of the objects into k clusters (possibly with some discarded as outliers) • Objective: • Maximum intra-cluster similarity • Minimum inter-cluster similarity • High consistency between the partitioning and the domain knowledge
Why semi-supervised clustering? • Why not clustering? • The clusters produced may not be the ones required. • Sometimes there are multiple possible groupings. • Why not classification? • Sometimes there are insufficient labeled data. • Potential applications • Bioinformatics (gene and protein clustering) • Document hierarchy construction • News/email categorization • Image categorization
This class Next class Semi-Supervised Clustering • Domain knowledge • Partial label information is given • Apply some constraints (must-links and cannot-links) • Approaches • Search-based Semi-Supervised Clustering • Alter the clustering algorithm using the constraints • Similarity-based Semi-Supervised Clustering • Alter the similarity measure based on the constraints • Combination of both
Search-Based Semi-Supervised Clustering • Alter the clustering algorithm that searches for a good partitioning by: • Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. • Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. • Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans,) [Basu:ICML02].
Overview of K-Means Clustering • K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: • Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum • Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster
K-Means Objective Function • Locally minimizes sum of squared distance between the data points and their correspondingcluster centers: • Initialization of K cluster centers: • Totally random • Random perturbation from global mean • Heuristic to ensure well-separated centers
Semi-Supervised K-Means • Partial label information is given • Seeded K-Means • Constrained K-Means • Constraints (Must-link, Cannot-link) • COP K-Means
Semi-Supervised K-Means for partially labeled data • Seeded K-Means: • Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. • Seed points are only used for initialization, and not in subsequent steps. • Constrained K-Means: • Labeled data provided by user are used to initialize K-Means algorithm. • Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.
Seeded K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points may change.
Seeded K-Means ExampleInitialize Means Using Labeled Data x x
Seeded K-Means ExampleAssign points to clusters and Converge x the label is changed x
Constrained K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points will not change.
Constrained K-Means ExampleInitialize Means Using Labeled Data x x
x x Constrained K-Means ExampleRe-estimate Means and Converge
COP K-Means • COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. • Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). • Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.
Illustration Determine its label Must-link x x Assign to the red class