320 likes | 550 Views
Clustering Methods. Moses Charikar Computer Science. Clustering. Partition data items into groups (clusters) Similar data items in the same group Dissimilar items in different groups unsupervised learning: groups not known apriori . Issues . Data representation
E N D
Clustering Methods Moses Charikar Computer Science
Clustering • Partition data items into groups (clusters) • Similar data items in the same group • Dissimilar items in different groups • unsupervised learning: • groups not known apriori Clustering Methods, Moses Charikar
Issues • Data representation • Definition of similarity/distance measure • Clustering procedure • Data abstraction • Cluster validation Clustering Methods, Moses Charikar
Data Representation • Quantitative features • Continuous values • Discrete values • Qualitative features • Nominal (unordered) • Ordinal Clustering Methods, Moses Charikar
Data Representation • d-dimensional points Clustering Methods, Moses Charikar
Distance Measures • Measure of dissimilarity • l1, l2, lp norms Clustering Methods, Moses Charikar
An Impossibility Result • [Kleinberg] • There is no clustering function that satisfies • scale-invariancescaling distances does not change result • richnessall partitions possible • (refinement) consistencyshrinking distances inside cluster, expanding distances across clusters Clustering Methods, Moses Charikar
Clustering techniques • Agglomerative vs. Divisive • Hard vs. Fuzzy • Incremental vs. Non-incremental Clustering Methods, Moses Charikar
Hierarchical Agglomerative Clustering • Initially, all points in distinct clusters • Maintain distance matrix on clusters • Merge most similar pair of clusters, update distance matrix • Repeat until all points in one cluster • Produces hierarchy of clusters (dendogram) Clustering Methods, Moses Charikar
Classical hierarchical methods • Single linkage clustering • Distance between clusters is minimum distance between points in clusters • Complete Linkage clustering • Distance is maximum distance between points in clusters • Produces more compact clusters Clustering Methods, Moses Charikar
Other classical variants • Group Average linkage • Median linkage • Centroid linkage Clustering Methods, Moses Charikar
Modern variants • CURE, ROCK, Chameleon, BIRCH • CURE • find clusters of arbitrary shapes • maintain multiple cluster representatives • originally selected as scattered points • shrunk to cluster centroid by parameter (suppressed effect of outliers) Clustering Methods, Moses Charikar
Divisive methods • Divide points into k clusters Clustering Methods, Moses Charikar
Graph-theoretic • Build minimum spanning tree on points • Remove longest k-1 edges to produce k disjoint connected components • Clusters identical to those produced by single link clustering Clustering Methods, Moses Charikar
k-means • k: number of clusters • nj: number of points in jth cluster • xij: ith point in jth cluster • min • Note: If clusters are fixed, best choice of center is centroid of cluster Clustering Methods, Moses Charikar
k-means • Pick k cluster centers at random • Assign each point to closest center • Recompute cluster centers • Repeat until convergence • Finds local minimum • Initial choice of cluster centers is important Clustering Methods, Moses Charikar
k-means tutorial slides by Andrew Moore Clustering Methods, Moses Charikar
Mixture models • Hypothesize that points are generated from mixture of k gaussians • Attempt to learn the best mixture of k gaussians that explains the data • Apply EM (Expectation Maximization) • Informally, similar to k-means with fuzzy assignment of points to clusters Clustering Methods, Moses Charikar
Mixture models • Gaussian mixture models tutorial slides by Andrew Moore Clustering Methods, Moses Charikar
Density Based Partitioning • DBSCAN • identify dense connected regions of the space • eps-neighborhood: points within distance eps • core object: point with number of points in neighborhood > threshold • y density reachable from x, if there exists path of core objects, each distance eps from previous Clustering Methods, Moses Charikar
Optimization approaches • Formulate objective function for clustering • View clustering as optimization problem • Find clustering so as to minimize/maximize objective function • Finding optimum solutions is hard ! • Design algorithm with approximation guarantee • -approximation: solution returned is within factor of optimum solution Clustering Methods, Moses Charikar
Factors affecting complexity • number of clusters • distance function on points • euclidean distances (dimension matters) • arbitrary metric • no triangle inequality • Objective function Clustering Methods, Moses Charikar
Clustering objective functions • k-center • max distance to cluster center • k-median • sum of distances to cluster centers • compare to k-means • minsum k-clustering • sum of distances within clusters Clustering Methods, Moses Charikar
Graph based clustering • Given graph on items • weights on edges represent similarity (dissimilarity) • Graph partitioning • Divide graph into k pieces, minimize (maximize) weight of edges cut Clustering Methods, Moses Charikar
Correlation clustering • Given judgements of similarity/dissimilarity between pairs of items, i.e. graph with edges labeled + and - • Find partitioning into clusters so that + edges inside clusters and - edges across clusters • If labeling is perfect, problem is easy • maximize agreements with labeling(suitable when optimal solution disagrees with large number of labels) • minimize disagreements with labeling(suitable with optimal solution agrees with almost all labels) Clustering Methods, Moses Charikar
Other issues • Cluster abstraction • assigning meaning to clusters • Outliers • High dimensional data • Large data set size Clustering Methods, Moses Charikar
Handling Large Data Sets • Random Sampling • Sample points and cluster sample • Streaming Algorithms • Cluster in one pass over data • Compact data summaries • maintain sketches of clusters • sketches of data points • Dimension reduction • singular value decomposition Clustering Methods, Moses Charikar