AMCS/CS 340: Data Mining

Clustering III AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Cluster Analysis • Whatis Cluster Analysis? • Partitioning Methods • Hierarchical Methods • Density-Based Methods • Grid-Based Methods • Model-Based Methods • Clustering High-Dimensional Data • How to decide the number of clusters? 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Grid-Based Clustering Method 3 • Specially useful on spatial data clustering • Spatial data --- geographically referenced data • temperature and salinity of Red sea Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Grid-Based Clustering Method 4 • Several interesting methods • STING (a STatisticalINformation Grid approach) by Wang, Yang and Muntz (VLDB’97) • WaveClusterby Sheikholeslami, Chatterjee, and Zhang (VLDB’98) • A multi-resolution clustering approach using wavelet method • CLIQUE: Agrawal, et al. (SIGMOD’98) • On high-dimensional data (thus put in the section of clustering high-dimensional data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

STING: A Statistical Information Grid Approach each cell is attached a number of sufficient statistics (count, maximum, minimum, mean, standard deviation) reflecting the set of data points falling in the cell. 5 • The spatial area is hierarchically divided into rectangular cells, corresponding to different levels of resolution • Efficiently process “region oriented” queries, related to the set of regions satisfying a number of conditions including area and density. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The STING Clustering Method 6 • Statistical info of each cell is calculated and stored beforehand and is used to answer queries • Parameters of higher level cells can be easily calculated from parameters of lower level cell • count, maximum, minimum, mean, standard deviation • type of distribution — normal, uniform, etc. • Use a top-down approach to answer spatial data queries Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The STING Clustering Method 7 • Advantages: • Query-independent, easy to parallelize, incremental update • O(K), where K is the number of grid cells at the lowest level • Disadvantages: • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected • the clustering quality depends on the grid granularity: too fine, and the computational cost exponentially increases; too coarse, the query answering quality is poor. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Model-Based Clustering 9 • What is model-based clustering? • Attempt to optimize the fit (likelihood) between the given data and some mathematical model • Based on the assumption: Data are generated by a mixture of underlying probability distribution Each component of the mixture  a cluster • E.g., Mixture of Gaussians Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Mixtures of Gaussians (1) Single Gaussian Mixture of two Gaussians 10 Old Faithful data set Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Mixtures of Gaussians (2) Component K=3 Mixing coefficient 11 Combine simple models into a complex model: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Mixtures of Gaussians (3) 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to determine the parameters of mixture models? E.g., ? Maximize the log likelihood Solution: iterative numerical optimization methods, the Expectation Maximization (EM) algorithm Determining the parameters? Log of a sum; no closed form maximum. 13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Expectation-Maximization (EM) is a general technique for estimating ML parameters of a model with latent variables. (Dempster et al., 1977) Two steps: E step: evaluate the posterior probabilities based on the current values of parameters M step: re-estimate the parameters EM Algorithm K-means ---- a particular limit of EM applied to mixtures of Gaussians 14 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Initialize  select k centroids E-step: evaluate the responsibilities  assign each point to its nearest centroid M-step: re-estimate the parameters using the current responsibilities compute new centroids check for the convergence of the parameters  check the changes of centroids k-means and EM algorithm 15 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

High-Dimensional Data • Applications: gene expression clustering • Column: samples, patients, conditions • Row: genes • Clusters of patients • Clusters of genes • Clusters of certain genes and certain patients 23

High-Dimensional Data • Applications:text documents mining • Column: documents • Row: words • Clusters of words • Clusters of documents • Clusters of certain words and certain documents 24

Challenges • Major challenges of clustering high-dimensional data • Many irrelevant dimensions may mask clusters • Distance measure becomes lessmeaningful • Clusters may exist only in some subspaces 25 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The Curse of Dimensionality(graphs adapted from Parsons et al. KDD Explorations 2004) • Data in only one dimension is relatively packed • Adding a dimension “stretch” the points across that dimension, making them further apart • Adding more dimensions will make the points further apart—high dimensional data is extremely sparse • Distance measure becomes less meaningful 26

Clusters in Subspace(adapted from Parsons et al. SIGKDD Explorations 2004) • Data points in 3-dimension • 4 clusters mixed together • Plot in one dimension with histogram Points from multiple clusters are mixed together

Clusters in Subspace(adapted from Parsons et al. SIGKDD Explorations 2004) • Clusters may exist only in some subspaces • Subspace-clustering: find clusters in all the subspaces • Plot in two dimension: two clusters are separated in (a) and (b)

Clustering High-Dimensional Data • Feature transformation: only effective if most dimensions are relevant • PCA & SVD useful only when features are highly correlated/redundant • Feature selection: wrapper or filter approaches • useful to find a subspace where the data have nice clusters • Subspace-clustering: find clusters in all the possible subspaces • ProClus(SIGMOD’99) • CLIQUE (SIGMOD’98) • Frequent pattern-based clustering (SIGMOD’02) 29 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Subspace Clustering • Subspace Clustering:seek to find clusters in a dataset by selecting the most relevant dimensions for each cluster separately. • There are 3 main approaches: • Top – down iterative approach (ProClus) • Find an initial approximation of the clusters in the full feature space with equally weighted dimensions, Next each dimension is assigned a weight for each cluster. • Bottom – up grid approach (CLIQUE) • Find dense unit in one dimension then merging them to find dense clusters in higher dimensional subspace. • Frequent pattern-based clustering 30 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Top-Down Subspace Clustering • ProClus(Agrawal et al, SIGMOD’99) • Modification of k-medoids algorithm • A top-down algorithm by splitting dense regions into different subspaces • User-specified number of clusters (K) and average cluster dimensionality (L) : unrealistic for real-world data sets • Uses cluster centers and points near to it to compute statistics. These determine the relevant cluster dimensions of the clusters 31 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Initialization • Greedy algorithm to select potential medoids that are far apart from each other The set of medoids 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Iteration 1. Find neighbors for each medoid mi The set of medoids neighbors Within radius d = min( ||mi - mj|| ), j=1…k 33 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Iteration 2. Find dimension (sub-space) for each medoid mi mediod Totally K*L Smallest values {Xij}, Report dimenstion j to medoid i Neighbors sort Increasing order Averaged distance on each dimension 34

Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Iteration 3. Form clusters, assign points to nearest medoid mi by computing Manhattan segmental distance (NOTE: xi and mi do not in the same dimension space ) sort 35

Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Iteration 4. Replace the bad medoids with the new medoids -- who are the bad medoids? the medoids attracted few points -- who are the new medoids? sort 36

Top-Down Subspace Clustering • ProClus -- Modification of k-medoids algorithm • Input: # clusters K, average dimension L • Initialization • Iteration, Repeat until no change • Cluster Refinement • Compute new dimensions for each medoid a procedure similar to “find subspace for each medoid” • Reassign points to medoids, removing outliers 37

Bottom-up Subspace Clustering • CLIQUE:Agrawal, et al (SIGMOD’98) • Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space • Partitions each dimension into the same number of equal length interval User specified grid size • Identifydense units from 1-d to k-d User specified density threshold • Identifyclusters combining dense regions (bottom-up) in different subspaces • Generalizeof minimal description • MDL principle are used as pruning method (minimize the description length) 39 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The quality of Clustering • For supervised classification we have a variety of measures to evaluate how good our model is • Accuracy, precision, recall • For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? • But “clusters are in the eye of the beholder”! • Then why do we want to evaluate them? • To avoid finding patterns in noise • To compare clustering algorithms • To compare two sets of clusters • To compare two clusters 46 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following two types: External Index: Used to measure the extent to which cluster labels match externally supplied class labels. Entropy, purity Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) Cophenetic correlation coefficient, silhouette Measures of Cluster Validity 47

Cluster Validity: External Index • The class labels are externally supplied (q classes) • Entropy: • Smaller entropy values indicate better clustering solutions • Entropy of each cluster Cr of size nr • is the number of instances in i-th class assigned to the rthcluster • Entropy of the entire clustering 48

Cluster Validity: External Index • The class labels are externally supplied (q classes) • Purity: • Larger purity values indicate better clustering solutions. • Purity of each cluster Cr of size nr • Purity of the entire clustering 49

Internal Measures: SSE • Internal Index: Used to measure the goodness of a clustering structure without respect to external information • SSE is good for comparing two clustering results • average SSE • SSE curves w.r.t. various K • Can also be used to estimate the number of clusters 50

Internal Measures: Cophenetic correlation coefficient • Cophenetic correlation coefficient: • a measure of how faithfully a dendrogram preserves the pairwise distances between the original data points. • Compare two hierarchical clusteringsof the data Compute the correlation coefficient between Dist and CP 0.71 A B 2.50 1.41 C 1.00 E 0.5 D F 51 Matlab functions: cophenet

Internal Measures: Cohesion and Separation • Cluster cohesion measures how closely related are objects in a cluster = SSE or the sum of the weight of all links within a cluster. • Cluster separation measures how distinct or well-separated a cluster is from other clusters = sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation 52

Internal Measures: Silhouette Coefficient • Silhouette Coefficient combines ideas of both cohesion and separation • For an individual point, i • Calculate a = average distance of i to the points in its cluster • Calculate b = min (average distance of i to points in another cluster) • The silhouette coefficient for a point is then given by • s = 1 – a/b • Typically between 0 and 1. • The closer to 1 the better. • Can calculate the Average Silhouette width for a cluster or a clustering Matlab functions: silhouette 53

Determine number of clusters by Silhouette Coefficient compare different clusterings by the average silhouette values K=3 mean(silh) = 0.526 K=4 mean(silh) = 0.640 K=5 mean(silh) = 0.527

Determine the number of clusters • Select the number K of clusters as the one maximizing averaged silhouette value of all points • Optimizing an objective criterion • Gap statistics of the decreasing of SSE w.r.t. K • Model-based method: • optimizing a global criterion (e.g. the maximum likelihood of data) • Try to use clustering methods which need not to set K, e.g., DbScan, • Prior knowledge, prior knowledge….. 55 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Clustering VS Classification 56 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Problems and Challenges • Considerable progress has been made in scalable clustering methods • Partitioning: k-means, k-medoids, CLARANS • Hierarchical: BIRCH, ROCK, CHAMELEON • Density-based: DBSCAN, OPTICS, DenClue • Grid-based: STING, WaveCluster, CLIQUE • Model-based: EM, SOM • Spectral clustering • Affinity Propagation • Frequent pattern-based: Bi-clustering, pCluster • Current clustering techniques do not address all the requirements adequately, still an active area of research 57 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Cluster Analysis Open issues in clustering Clustering quality evaluation How to decide the number of clusters ? 58 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

References • Kriegel et al, "Clustering High Dimensional Data: A Survey on Subspace Clustering, Pattern-based Clustering, and Correlation Clustering“, TKDD 2009 • Lance Parsons, Evaluating subspace clustering algorithms, SIGKDD 2004 • Agrawal et al, Fast algorithms for projected clustering (PROCLUS), SIGMOD 1999 • Agrawal et al,Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications (CLIQUE), SIGMOD 1998 • Y. Cheng and G. Church. Biclustering of expression data. International Conference on Intelligent System for Molecular Biology, 2000. • Haixun Wang et al, Clustering by Pattern Similarity in Large Data Sets (p-clustering), SIGMOD 2002 59 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

References • Ben-Hur A, Elisseeff A, Guyon I: A stability based method for discovering structure in clustered data. Proceedings of PSB 2002. • Shai Ben-David, Ulrike von Luxburg, and David Pal. A sober look at clustering stability. In Proceedings of the Annual Conference on Learning Theory, COLT 2006. • Tibshirani R, Walther G, Hastie. T: Estimating the Number of Clusters in a Dataset via the Gap Statistic. Technical report, Dept of Biostatistics, 2001 Stanford University. • Dudoit S, Fridlyand J: A prediction based resampling method for estimating the number of clusters in a data set. Genome Biology 2002. 60 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

What you should know 61 • General idea and application domain of grid-based clustering methods • General idea of model-based clustering method • What are the main algorithms for subspace clustering? • How to evaluate the clustering results ? • Usually how to decide the number of clusters ? • What are the main differences between clustering and classification? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

AMCS/CS 340: Data Mining