1 / 55

Semi-Supervised Clustering

Semi-Supervised Clustering. CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu. Outline of lecture. Overview of clustering and classification What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification Semi-supervised clustering

vangundy
Download Presentation

Semi-Supervised Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-Supervised Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu

  2. Outline of lecture • Overview of clustering and classification • What is semi-supervised learning? • Semi-supervised clustering • Semi-supervised classification • Semi-supervised clustering • What is semi-supervised clustering? • Why semi-supervised clustering? • Semi-supervised clustering algorithms

  3. Supervised classification versus unsupervised clustering • Unsupervised clustering • Group similar objects together to find clusters • Minimize intra-class distance • Maximize inter-class distance • Supervised classification • Class label for each training sample is given • Build a model from the training data • Predict class label on unseen future data points

  4. Inter-cluster distances are maximized Intra-cluster distances are minimized What is clustering? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

  5. What is Classification?

  6. Clustering algorithms • K-Means • Hierarchical clustering • Graph based clustering (Spectral clustering)

  7. Classification algorithms • Decision Trees • Naïve Bayes classifier • Support Vector Machines (SVM) • K-Nearest-Neighbor classifiers • Logistic Regression • Neural Networks • Linear Discriminant Analysis (LDA)

  8. Supervised Classification Example . . . .

  9. Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

  10. Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

  11. Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  12. Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  13. Semi-supervised learning Supervised classification Unsupervised clustering Semi-Supervised Learning • Combines labeled and unlabeled data during training to improve performance: • Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. • Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.

  14. Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

  15. Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

  16. Semi-Supervised Classification • Algorithms: • Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. • Co-training [Blum:COLT98]. • Transductive SVM’s [Vapnik:98,Joachims:ICML99]. • Graph based algorithms • Assumptions: • Known, fixed set of categories given in the labeled data. • Goal is to improve classification of examples into these known categories. • More discussions next week

  17. Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  18. Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  19. Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  20. Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  21. Semi-supervised clustering: problem definition • Input: • A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical) • A small amount of domain knowledge • Output: • A partitioning of the objects into k clusters (possibly with some discarded as outliers) • Objective: • Maximum intra-cluster similarity • Minimum inter-cluster similarity • High consistency between the partitioning and the domain knowledge

  22. Why semi-supervised clustering? • Why not clustering? • The clusters produced may not be the ones required. • Sometimes there are multiple possible groupings. • Why not classification? • Sometimes there are insufficient labeled data. • Potential applications • Bioinformatics (gene and protein clustering) • Document hierarchy construction • News/email categorization • Image categorization

  23. This class Next class Semi-Supervised Clustering • Domain knowledge • Partial label information is given • Apply some constraints (must-links and cannot-links) • Approaches • Search-based Semi-Supervised Clustering • Alter the clustering algorithm using the constraints • Similarity-based Semi-Supervised Clustering • Alter the similarity measure based on the constraints • Combination of both

  24. Search-Based Semi-Supervised Clustering • Alter the clustering algorithm that searches for a good partitioning by: • Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. • Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. • Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans,) [Basu:ICML02].

  25. Overview of K-Means Clustering • K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: • Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum • Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

  26. K-Means Objective Function • Locally minimizes sum of squared distance between the data points and their correspondingcluster centers: • Initialization of K cluster centers: • Totally random • Random perturbation from global mean • Heuristic to ensure well-separated centers

  27. K Means Example

  28. K Means ExampleRandomly Initialize Means x x

  29. K Means ExampleAssign Points to Clusters x x

  30. K Means ExampleRe-estimate Means x x

  31. K Means ExampleRe-assign Points to Clusters x x

  32. K Means ExampleRe-estimate Means x x

  33. K Means ExampleRe-assign Points to Clusters x x

  34. K Means ExampleRe-estimate Means and Converge x x

  35. Semi-Supervised K-Means • Partial label information is given • Seeded K-Means • Constrained K-Means • Constraints (Must-link, Cannot-link) • COP K-Means

  36. Semi-Supervised K-Means for partially labeled data • Seeded K-Means: • Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. • Seed points are only used for initialization, and not in subsequent steps. • Constrained K-Means: • Labeled data provided by user are used to initialize K-Means algorithm. • Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.

  37. Seeded K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points may change.

  38. Seeded K-Means Example

  39. Seeded K-Means ExampleInitialize Means Using Labeled Data x x

  40. Seeded K-Means ExampleAssign Points to Clusters x x

  41. Seeded K-Means ExampleRe-estimate Means x x

  42. Seeded K-Means ExampleAssign points to clusters and Converge x the label is changed x

  43. Constrained K-Means Use labeled data to find the initial centroids and then run K-Means. The labels for seeded points will not change.

  44. Constrained K-Means Example

  45. Constrained K-Means ExampleInitialize Means Using Labeled Data x x

  46. Constrained K-Means ExampleAssign Points to Clusters x x

  47. x x Constrained K-Means ExampleRe-estimate Means and Converge

  48. COP K-Means • COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. • Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). • Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.

  49. COP K-Means Algorithm

  50. Illustration Determine its label Must-link x x Assign to the red class

More Related