1 / 59

Clustering

Clustering. Rong Jin. What is Clustering?. Identify the underlying structure for given data points Doc. clustering: groups documents of same topics into the same cluster. $$$. age. query. Improve IR by Document Clustering. Cluster-based retrieval. Improve IR by Document Clustering.

malia
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Rong Jin

  2. What is Clustering? • Identify the underlying structure for given data points • Doc. clustering: groups documents of same topics into the same cluster $$$ age

  3. query Improve IR by Document Clustering • Cluster-based retrieval

  4. Improve IR by Document Clustering • Cluster-based retrieval • Cluster docs in collection a priori • Only compute the relevance scores for docs in the cluster closest to the query • Improve retrieval efficiency by only search a small portion of the document collection

  5. Application (I): Search Result Clustering

  6. Application (II): Navigation

  7. Application (III): Google News

  8. Application (III): Visualization Islands of music (Pampalk et al., KDD’ 03)

  9. How to Find good Clusters? x2 x1 x4 x3 x5 x6 x7

  10. How to Find good Clusters? • Measure the compactness by the sum of distance square within clusters x2 x1 x4 x3 x5 x6 x7

  11. How to Find good Clusters? • Measure the compactness by the sum of distance square within clusters x2 x1 x4 x3 x5 x6 x7

  12. How to Find good Clusters? • Measure the compactness by the sum of distance square within clusters x2 x1 C1 x4 x3 x5 x6 C2 x7

  13. How to Find good Clusters? • Measure the compactness by the sum of distance square within clusters x2 x1 x4 x3 C1 C2 x5 x6 x7

  14. How to Find good Clusters? • Measure the compactness by the sum of distance square within clusters • Membership indicators: mi,j =1 if xi is assigned to Cj, and zero otherwise. x2 x1 C1 x4 x3 x5 x6 C2 x7

  15. How to Find good Clusters? • Measure the compactness by the sum of distance square within clusters • Membership indicators: mi,j =1 if xi is assigned to Cj, and zero otherwise. x2 x1 C1 x4 x3 x5 x6 C2 x7

  16. How to Find good Clusters? • Measure the compactness by the sum of distance square within clusters x2 x1 C1 x4 x3 x5 x6 C2 x7

  17. How to Find good Clusters? • Measure the compactness by the sum of distance square within clusters • Find good clusters by minimizing the cluster compactness • Cluster centers C1 and C2 • Membership mi,j x2 x1 C1 x4 x3 x5 x6 C2 x7

  18. How to Find good Clusters? • Measure the compactness by the sum of distance square within clusters • Find good clusters by minimizing the cluster compactness • Cluster centers C1 and C2 • Membership mi,j x2 x1 C1 x4 x3 x5 x6 C2 x7

  19. How to Find good Clusters? • Find good clusters by minimizing the cluster compactness • Cluster centers C1 and C2 • Membership mi,j x2 x1 C1 x4 x3 x5 x6 C2 x7

  20. How to Efficiently Cluster Data? Update mi,j: assign xi to the closest Cj

  21. How to Efficiently Cluster Data? Update mi,j: assign xi to the closest Cj Update Cj as the average of xi assigned to Cj

  22. How to Efficiently Cluster Data? Update mi,j: assign xi to the closest Cj K-means algorithm Update Cj as the average of xi assigned to Cj

  23. Example of k-means • Start with random cluster centers C1 than to C2 x2 x1 x4 x3 C1 C2 x5 x6 x7

  24. Example of k-means • Identify the points that are closer to C1 than to C2 x2 x1 x4 x3 C1 C2 x5 x6 x7

  25. Example of k-means • Update C1 x2 x1 x4 x3 C1 C2 x5 x6 x7

  26. Example of k-means • Identify the points that are closer to C2 than to C1 x2 x1 x4 x3 C1 C2 x5 x6 x7

  27. Example of k-means • Identify the points that are closer to C2 than to C1 x2 x1 x4 x3 C1 x5 x6 C2 x7

  28. Example of k-means • Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2 x2 x1 x4 x3 C1 x5 x6 C2 x7

  29. Example of k-means • Identify the points that are closer to C2 than C1, and points that are closer to C1 than to C2 • Update C1 and C2 x2 x1 C1 x4 x3 x5 x6 C2 x7

  30. K-means for Clustering • K-means • Start with a random guess of cluster centers • Determine the membership of each data points • Adjust the cluster centers

  31. K-means for Clustering • K-means • Start with a random guess of cluster centers • Determine the membership of each data points • Adjust the cluster centers

  32. K-means for Clustering • K-means • Start with a random guess of cluster centers • Determine the membership of each data points • Adjust the cluster centers

  33. K-means • Ask user how many clusters they’d like. (e.g. k=5)

  34. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations

  35. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

  36. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns

  37. K-means • Ask user how many clusters they’d like. (e.g. k=5) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns

  38. K-means Any Computational Problem ?

  39. K-means Need to go through each data point at each iteration of k-means

  40. Improve K-means • Group nearby data points by region • KD tree • SR tree • Try to update the membership for all the data points in the same region

  41. Improved K-means • Find the closest center for each rectangle • Assign all the points within a rectangle to one cluster

  42. Document Clustering

  43. A Mixture Model for Document Clustering • Assume that data are generated from a mixture of multinomial distributions • Estimate the mixture distribution from the observed documents

  44. Gaussian Mixture Example: Start Measure the probability for every data point to be associated with each cluster

  45. After First Iteration

  46. After 2nd Iteration

  47. After 3rd Iteration

  48. After 4th Iteration

  49. After 5th Iteration

  50. After 6th Iteration

More Related