1 / 49

AMCS/CS 340: Data Mining

Clustering III. AMCS/CS 340: Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Cluster Analysis. What is Cluster Analysis? Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods

jasia
Download Presentation

AMCS/CS 340: Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering III AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

  2. Cluster Analysis • Whatis Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  3. Grid-Based Clustering Method 3 • Specially useful on spatial data clustering • Spatial data --- geographically referenced data • temperature and salinity of Red sea Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  4. Grid-Based Clustering Method 4 • Several interesting methods • STING (a STatisticalINformation Grid approach) by Wang, Yang and Muntz (VLDB’97) • WaveClusterby Sheikholeslami, Chatterjee, and Zhang (VLDB’98) • A multi-resolution clustering approach using wavelet method • CLIQUE: Agrawal, et al. (SIGMOD’98) • On high-dimensional data (thus put in the section of clustering high-dimensional data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  5. STING: A Statistical Information Grid Approach each cell is attached a number of sufficient statistics (count, maximum, minimum, mean, standard deviation) reflecting the set of data points falling in the cell. 5 • The spatial area is hierarchically divided into rectangular cells, corresponding to different levels of resolution • Efficiently process “region oriented” queries, related to the set of regions satisfying a number of conditions including area and density. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  6. The STING Clustering Method 6 • Statistical info of each cell is calculated and stored beforehand and is used to answer queries • Parameters of higher level cells can be easily calculated from parameters of lower level cell • count, maximum, minimum, mean, standard deviation • type of distribution — normal, uniform, etc. • Use a top-down approach to answer spatial data queries Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  7. The STING Clustering Method 7 • Advantages: • Query-independent, easy to parallelize, incremental update • O(K), where K is the number of grid cells at the lowest level • Disadvantages: • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected • the clustering quality depends on the grid granularity: too fine, and the computational cost exponentially increases; too coarse, the query answering quality is poor. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  8. Cluster Analysis • Whatis Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 8 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  9. Model-Based Clustering 9 • What is model-based clustering? • Attempt to optimize the fit (likelihood) between the given data and some mathematical model • Based on the assumption: Data are generated by a mixture of underlying probability distribution Each component of the mixture  a cluster • E.g., Mixture of Gaussians Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  10. Mixtures of Gaussians (1) Single Gaussian Mixture of two Gaussians 10 Old Faithful data set Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  11. Mixtures of Gaussians (2) Component K=3 Mixing coefficient 11 Combine simple models into a complex model: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  12. Mixtures of Gaussians (3) 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  13. How to determine the parameters of mixture models? E.g., ? Maximize the log likelihood Solution: iterative numerical optimization methods, the Expectation Maximization (EM) algorithm Determining the parameters? Log of a sum; no closed form maximum. 13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  14. Expectation-Maximization (EM) is a general technique for estimating ML parameters of a model with latent variables. (Dempster et al., 1977) Two steps: E step: evaluate the posterior probabilities based on the current values of parameters M step: re-estimate the parameters EM Algorithm K-means ---- a particular limit of EM applied to mixtures of Gaussians 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  15. Initialize  select k centroids E-step: evaluate the responsibilities  assign each point to its nearest centroid M-step: re-estimate the parameters using the current responsibilities compute new centroids check for the convergence of the parameters  check the changes of centroids k-means and EM algorithm 15 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  16. Cluster Analysis • Whatis Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 22 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  17. High-Dimensional Data • Applications: gene expression clustering • Column: samples, patients, conditions • Row: genes • Clusters of patients • Clusters of genes • Clusters of certain genes and certain patients 23

  18. High-Dimensional Data • Applications:text documents mining • Column: documents • Row: words • Clusters of words • Clusters of documents • Clusters of certain words and certain documents 24

  19. Challenges • Major challenges of clustering high-dimensional data • Many irrelevant dimensions may mask clusters • Distance measure becomes lessmeaningful • Clusters may exist only in some subspaces 25 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  20. The Curse of Dimensionality(graphs adapted from Parsons et al. KDD Explorations 2004) • Data in only one dimension is relatively packed • Adding a dimension “stretch” the points across that dimension, making them further apart • Adding more dimensions will make the points further apart—high dimensional data is extremely sparse • Distance measure becomes less meaningful 26

  21. Clusters in Subspace(adapted from Parsons et al. SIGKDD Explorations 2004) • Data points in 3-dimension • 4 clusters mixed together • Plot in one dimension with histogram Points from multiple clusters are mixed together

  22. Clusters in Subspace(adapted from Parsons et al. SIGKDD Explorations 2004) • Clusters may exist only in some subspaces • Subspace-clustering: find clusters in all the subspaces • Plot in two dimension: two clusters are separated in (a) and (b)

  23. Clustering High-Dimensional Data • Feature transformation: only effective if most dimensions are relevant • PCA & SVD useful only when features are highly correlated/redundant • Feature selection: wrapper or filter approaches • useful to find a subspace where the data have nice clusters • Subspace-clustering: find clusters in all the possible subspaces • ProClus(SIGMOD’99) • CLIQUE (SIGMOD’98) • Frequent pattern-based clustering (SIGMOD’02) 29 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  24. Subspace Clustering • Subspace Clustering:seek to find clusters in a dataset by selecting the most relevant dimensions for each cluster separately. • There are 3 main approaches: • Top – down iterative approach (ProClus) • Find an initial approximation of the clusters in the full feature space with equally weighted dimensions, Next each dimension is assigned a weight for each cluster. • Bottom – up grid approach (CLIQUE) • Find dense unit in one dimension then merging them to find dense clusters in higher dimensional subspace. • Frequent pattern-based clustering 30 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  25. Top-Down Subspace Clustering • ProClus(Agrawal et al, SIGMOD’99) • Modification of k-medoids algorithm • A top-down algorithm by splitting dense regions into different subspaces • User-specified number of clusters (K) and average cluster dimensionality (L) : unrealistic for real-world data sets • Uses cluster centers and points near to it to compute statistics. These determine the relevant cluster dimensions of the clusters 31 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  26. Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Initialization • Greedy algorithm to select potential medoids that are far apart from each other The set of medoids 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  27. Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Iteration 1. Find neighbors for each medoid mi The set of medoids neighbors Within radius d = min( ||mi - mj|| ), j=1…k 33 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  28. Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Iteration 2. Find dimension (sub-space) for each medoid mi mediod Totally K*L Smallest values {Xij}, Report dimenstion j to medoid i Neighbors sort Increasing order Averaged distance on each dimension 34

  29. Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Iteration 3. Form clusters, assign points to nearest medoid mi by computing Manhattan segmental distance (NOTE: xi and mi do not in the same dimension space ) sort 35

  30. Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Iteration 4. Replace the bad medoids with the new medoids -- who are the bad medoids? the medoids attracted few points -- who are the new medoids? sort 36

  31. Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Initialization • Iteration, Repeat until no change • Cluster Refinement • Compute new dimensions for each medoid a procedure similar to “find subspace for each medoid” • Reassign points to medoids, removing outliers 37

  32. Bottom-up Subspace Clustering • CLIQUE:Agrawal, et al (SIGMOD’98) • Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space • Partitions each dimension into the same number of equal length interval User specified grid size • Identifydense units from 1-d to k-d User specified density threshold • Identifyclusters combining dense regions (bottom-up) in different subspaces • Generalizeof minimal description • MDL principle are used as pruning method (minimize the description length) 39 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  33. Cluster Analysis • Whatis Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 45 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  34. The quality of Clustering • For supervised classification we have a variety of measures to evaluate how good our model is • Accuracy, precision, recall • For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? • But “clusters are in the eye of the beholder”! • Then why do we want to evaluate them? • To avoid finding patterns in noise • To compare clustering algorithms • To compare two sets of clusters • To compare two clusters 46 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  35. Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following two types: External Index: Used to measure the extent to which cluster labels match externally supplied class labels. Entropy, purity Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) Cophenetic correlation coefficient, silhouette Measures of Cluster Validity 47

  36. Cluster Validity: External Index • The class labels are externally supplied (q classes) • Entropy: • Smaller entropy values indicate better clustering solutions • Entropy of each cluster Cr of size nr • is the number of instances in i-th class assigned to the rthcluster • Entropy of the entire clustering 48

  37. Cluster Validity: External Index • The class labels are externally supplied (q classes) • Purity: • Larger purity values indicate better clustering solutions. • Purity of each cluster Cr of size nr • Purity of the entire clustering 49

  38. Internal Measures: SSE • Internal Index: Used to measure the goodness of a clustering structure without respect to external information • SSE is good for comparing two clustering results • average SSE • SSE curves w.r.t. various K • Can also be used to estimate the number of clusters 50

  39. Internal Measures: Cophenetic correlation coefficient • Cophenetic correlation coefficient: • a measure of how faithfully a dendrogram preserves the pairwise distances between the original data points. • Compare two hierarchical clusteringsof the data Compute the correlation coefficient between Dist and CP 0.71 A B 2.50 1.41 C 1.00 E 0.5 D F 51 Matlab functions: cophenet

  40. Internal Measures: Cohesion and Separation • Cluster cohesion measures how closely related are objects in a cluster = SSE or the sum of the weight of all links within a cluster. • Cluster separation measures how distinct or well-separated a cluster is from other clusters = sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation 52

  41. Internal Measures: Silhouette Coefficient • Silhouette Coefficient combines ideas of both cohesion and separation • For an individual point, i • Calculate a = average distance of i to the points in its cluster • Calculate b = min (average distance of i to points in another cluster) • The silhouette coefficient for a point is then given by • s = 1 – a/b • Typically between 0 and 1. • The closer to 1 the better. • Can calculate the Average Silhouette width for a cluster or a clustering Matlab functions: silhouette 53

  42. Determine number of clusters by Silhouette Coefficient compare different clusterings by the average silhouette values K=3 mean(silh) = 0.526 K=4 mean(silh) = 0.640 K=5 mean(silh) = 0.527

  43. Determine the number of clusters • Select the number K of clusters as the one maximizing averaged silhouette value of all points • Optimizing an objective criterion • Gap statistics of the decreasing of SSE w.r.t. K • Model-based method: • optimizing a global criterion (e.g. the maximum likelihood of data) • Try to use clustering methods which need not to set K, e.g., DbScan, • Prior knowledge, prior knowledge….. 55 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  44. Clustering VS Classification 56 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  45. Problems and Challenges • Considerable progress has been made in scalable clustering methods • Partitioning: k-means, k-medoids, CLARANS • Hierarchical: BIRCH, ROCK, CHAMELEON • Density-based: DBSCAN, OPTICS, DenClue • Grid-based: STING, WaveCluster, CLIQUE • Model-based: EM, SOM • Spectral clustering • Affinity Propagation • Frequent pattern-based: Bi-clustering, pCluster • Current clustering techniques do not address all the requirements adequately, still an active area of research 57 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  46. Cluster Analysis Open issues in clustering Clustering quality evaluation How to decide the number of clusters ? 58 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  47. References • Kriegel et al, "Clustering High Dimensional Data: A Survey on Subspace Clustering, Pattern-based Clustering, and Correlation Clustering“, TKDD 2009 • Lance Parsons, Evaluating subspace clustering algorithms, SIGKDD 2004 • Agrawal et al, Fast algorithms for projected clustering (PROCLUS), SIGMOD 1999 • Agrawal et al,Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications (CLIQUE), SIGMOD 1998 • Y. Cheng and G. Church. Biclustering of expression data. International Conference on Intelligent System for Molecular Biology, 2000. • Haixun Wang et al, Clustering by Pattern Similarity in Large Data Sets (p-clustering), SIGMOD 2002 59 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  48. References • Ben-Hur A, Elisseeff A, Guyon I: A stability based method for discovering structure in clustered data. Proceedings of PSB 2002. • Shai Ben-David, Ulrike von Luxburg, and David Pal. A sober look at clustering stability. In Proceedings of the Annual Conference on Learning Theory, COLT 2006. • Tibshirani R, Walther G, Hastie. T: Estimating the Number of Clusters in a Dataset via the Gap Statistic. Technical report, Dept of Biostatistics, 2001 Stanford University. • Dudoit S, Fridlyand J: A prediction based resampling method for estimating the number of clusters in a data set. Genome Biology 2002. 60 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  49. What you should know 61 • General idea and application domain of grid-based clustering methods • General idea of model-based clustering method • What are the main algorithms for subspace clustering? • How to evaluate the clustering results ? • Usually how to decide the number of clusters ? • What are the main differences between clustering and classification? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

More Related