260 likes | 274 Views
Understand the fundamentals and algorithms of clustering, exploring taxonomy, clustering decisions, proximity measures, and major techniques like hierarchical clustering and k-means. Dive into the world of clustering in an unsupervised pattern classifying manner.
E N D
Clustering… in General • In vector space, clusters are vectors found within e of a cluster vector, with different techniques for determining the cluster vector and e. • Clustering is unsupervised pattern classification. • Unsupervised means no correct answer or feedback. • Patterns typically are samples of feature vectors or matrices. • Classification means collecting the samples into groups of similar members.
Clustering Decisions • Pattern Representation • feature selection (e.g., stop word removal, stemming) • number of categories • Pattern proximity • distance measure on pairs of patterns • Grouping • characteristics of clusters (e.g., fuzzy, hierarchical) Clustering algorithms embody different assumptions about these decisions and the form of clusters.
Formal Definitions • Feature vectorx is a single datum of d measurements. • Hard clustering techniques assign a class label to each cluster; members of clusters are mutually exclusive. • Fuzzy clustering techniques assign a fractional degree of membership to each label for each x.
Proximity Measures • Generally, use Euclidean distance or mean squared distance. • In IR, use similarity measure from retrieval (e.g., cosine measure for TFIDF).
[Jain, Murty & Flynn] Taxonomy of Clustering Clustering Hierarchical Partitional Single Link Complete Link Square Error Graph Theoretic Mixture Resolving Mode Seeking Expectation Minimization HAC k-means
Hierarchical Algorithms • Produce hierarchy of classes (taxonomy) from singleton clusters to just one cluster. • Select level for extracting cluster set. • Representation is a dendrogram.
Complete-Link Revisited • Used to create statistical thesaurus • Agglomerative, hard, deterministic, batch • Start with 1 cluster/sample • Find two clusters with lowest distance • Merge two clusters and add to hierarchy • Repeat from 2 until termination criterion or until all clusters have merged
Single-Link • Like Complete-Link except… • use minimum of distances between all pairs of samples in the two clusters (complete-link uses maximum). • Single-link has chaining effect with elongated clusters, but can construct more complex shapes.
Complete-Link Solution C15 C13 C14 C11 C12 C10 C8 C7 C6 C9 C4 C5 C1 C2 C3 29,26 1,28 23,32 46,30 45,42 9,16 21,15 29,22 21,27 33,21 35,35 4,9 13,18 26,25 31,15 42,45
C15 Single-Link Solution C14 C12 C11 C13 C8 C7 C10 C3 C9 C2 C5 C6 C4 C1 29,26 1,28 23,32 46,30 45,42 9,16 21,15 29,22 21,27 33,21 35,35 4,9 13,18 26,25 31,15 42,45
Hierarchical Agglomerative Clustering (HAC) • Agglomerative, hard, deterministic, batch • Start with 1 cluster/sample and compute a proximity matrix between pairs of clusters. • Merge most similar pair of clusters and update proximity matrix. • Repeat 2 until all clusters merged. • Difference is in how proximity matrix is updated. • Ability to combine benefits of both single and complete link algorithms.
Intra-cluster Similarity where S is TFIDF vectors for documents, c is centroid of cluster X, and d is a document. Proximity is similarity of all documents to the cluster centroid. Select pair of clusters that produces the smallest decrease in similarity, e.g., if merge(X,Y)=>Z, then max[Sim(Z)-(Sim(X)+Sim(Y))] HAC for IR
Centroid Similarity cosine similarity between the centroid of the two clusters UPGMA HAC for IR- Alternatives
Partitional Algorithms • Results in set of unrelated clusters. • Issues: • how many clusters is enough? • how to search space of possible partitions? • what is appropriate clustering criterion?
K Means • Number of clusters is set by user to be k. • Non-deterministic • Clustering criterion is squared error: where S is document set, L is a clustering, K is number of clusters, x is ith document in jth cluster and c is centroid of jth cluster.
k-Means Clustering Algorithm • Randomly select k samples as cluster centroids. • Assign each pattern to the closest cluster centroid. • Recompute centroids. • If convergence criterion (e.g., minimal decrease in error or no change in cluster composition) is not met, return to 2.
k-Means Sensitivity to Initialization F G C D E B A K=3, red started w/A, D, F; yellow w/A, B, C
k-Means for IR • Update centroids incrementally • Calculate centroid as with hierarchical methods. • Can refine into a divisive hierarchical method by starting with single cluster and splitting using k-means until forms k clusters with highest summed similarities. (bisecting k-means)
Other Types of Clustering Algorithms Graph Theoretic: construct minimal spanning tree and delete edges with largest lengths Expectation Minimization (EM): assume clusters are drawn from distributions, use maximum likelihood to estimate parameters of distributions. Nearest Neighbors: iteratively assign each sample to the cluster of its nearest labelled neighbor, so long as distance is below a set threshold.
Comparison of Clustering Algorithms [Steinbach et al.] • Implement 3 versions of HAC and 2 versions of k-Means • Compare performance on documents hand labelled as relevant to one of a set of classes. • Well known data sets (TREC) • Found that UPGMA is best of hierarchical, but bisecting k-means seems to do better if considered over many runs. M. Steinbach, G. Karypis, V.Kumar. A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining, 2000.
Evaluation Metrics 1 Evaluation: how to measure cluster quality? • Entropy: where pij is probability that a member of cluster j belongs to class i, nj is size of cluster j, m is number of clusters, n is number of docs and CS is a clustering solution.
Comparison Measure 2 • F measure: combines precision and recall • treat each cluster as the result of a query and each class as the relevant set of docs nij is # of members of class i in cluster j, nj is # in j, ni is # in i, n is # of docs.