240 likes | 249 Views
Explore clustering processes, proximity measures, hierarchical clustering, K-means, and partitioning methods for data structuring and pattern recognition. Understand algorithms, distance functions, and advantages/disadvantages of clustering approaches.
E N D
Clustering algorithms and methods Andreas Held - Review and usage - 28.June.2007
Content • What is a cluster and the clustering process • Proximity measures • Hierarchical clustering • Agglomerative • Divisive • Partitioning clustering • K-means • Density-based Clustering • DBSCAN
The Cluster • A cluster is a group or accumulation of objects with similar attributes • conditions to clusters: (i) Homogeneity within a cluster (ii) Heterogeneity to other clusters • possible objects in biology: - genes (transcriptomics) - individuals (plant systematics), - sequences (sequence analysis) Ruspini-dataset: artifical generated dataset
Objectives of Clustering • Generation of preferably homogeneous and heterogeneous clusters • Identification of categories, classes or groups in the data • Recognition of relations within the data • Concise Structuring of the data (e.g. dendogram)
The clustering process • the expression levels of genes under • different conditions experimental data • take only the expression levels under • the conditions which interest • => attribute vectors: xi = (y1, …, ym) preprocessing • create the raw-data-matrix by writing the • attribute-vectors among each other raw-data-matrix • define the distance or similarity functions • and build the distance-matrix so that on • their rows and columns the objects are • confronted. proximity measures • choose a clustering algorithm and use it • on the data clustering algorithm
condition 2 Euclidian distance: 4 Euclidian distance 3 Manhattan distance: 2 Manhattan distance Maximum distance: 1 d(x, y) = maxi ( | xi - yi |) Maximum distance 1 2 3 4 condition 1 Distance functions for objects - d(x, y) calculates the distance between the two objects x and y - Distance measures: - Example:
Distance measures for cluster Calculating the distance between two clusters is important for some algorithms (e.g. hierarchical algorithms) Condition 2 Single Linkage: min {d(a, b) : a A, b B} Cluster Y 5 D Complete Linkage 4 Complete Linkage: max {d(a, b) : a A, b B} Average Linkage 3 C Single Linkage 2 B Average Linkage: 1 Cluster X A Condition 1 1 3 4 5 2
Hierarchical Clustering • Two methods of hierarchical Clustering: • agglomerative (bottom-up) • divise (top-down) • agglomerative vs. divisive: • divise and agglomerative methods produce the same results • divise algorithms need much more computing power so in practical only agglomerative methods are used. • Agglomrative algorithm example: UPGMA used in phylogenetics • Conditions: • given distance or similarity measure for objects • given distance measure for cluster • result is a dendrogram
Agglomerative hierarchical clustering Algorithm: Find the two Clusters with the closest distance and put those two Clusters into one. Compute the new Distance-Matrix. Construct the finest Partition and compute Distance matrix D Start with objects and given distance measure between clusters Until all clusters are agglomerated d Distance measures: - Manhattan Distance - Single Linkage 9 D c 8 E 7 C 6 b b 5 a c 4 3 B 2 a A 1 A B C D E 1 2 3 4 5 6 7 Clusters Dendrogram Distance-Matrix
Hierarchical clustering- conclusions - • Advantages: • Dendrogram allows interpretation • depending on the level of the dendogram the different clustering grades can be explored. • Usage on all data spaces if a distance measure can be defined • Disadvantages: • The user has to identify the clusters by himself • recalculations of the great distance-matrix makes the algorithm resource intensive • Higher runtimes vs. Non-hierarchic-methods
Partioning Clustering -k-means algorithm- • merge n objects into k cluster • calculate centroids from given clustering: ,ci centroid of cluster Ci • calculate clustering from given centroids: => merge objects into the cluster with minimum distance to its centroid
k-means algorithm principle • In general neither the centroids nor the clustering is known • Guessing Cluster center (centroid) Clustering
k-means Algorithm - euclidian distance - k = 3 0) Init: place randomly 3 cluster-centroids 1.0 0.9 1) Join every object into the cluster with the nearest cluster-centroid 0.8 0.7 2) Compute the new cluster-centroids from the given clustering 0.6 0.5 3) Repeat 1) and 2) until all centroids stop moving 0.4 0.3 0.2 • in each step the values get better for the centroids and the clustering 0.1 0.2 0.4 0.6 0.8 1.0
k-means algorithm- problems - • Not every run achieves the same result, because the result depends on random initiation of the clusters • Run the algorithm a couple of times and take the best result • fixed number of clusters need to be known before starting the algorithm => try different values for k and take the best result • the problem to compute the optimal number of clusters is not trivial. An approach is the elbow criterion.
k-means algorithm - advantages - • easy to implement • linear runtime allows execution on large databases • For example the clustering of microarray data: • depending on the experiment: 20.000 dimensional vectors
Partioning Clustering- density-based method - • Condition on the data space: data space where objects are closely together separated from areas where the objects are less closely together => Cluster with arbitrary shape get found
Density-based clustering - parameters - • : the environment around an object (o): all objects in the -environment of object o • MinPts: minimum number of objects, that have to be in an object-environment, so that this object is core object o o
Density-based clustering - definitions - o • object o O is core object, if: | (o) | ≥ MinPts • object p O is directly density-reachable from q O, if p (q) • A object p is density-reachable from an object q, if there is a chain of directly density-reachable objects between p and q. q p q p
Density-based clustering- example DBSCAN- Parameter: Algorithm: MinPts = 4 1) Explore incremental all objects : see below 2) find core object (e(o) ≥ MinPts = 4) 3) Start with a new cluster and merge the object to this cluster 4) Search for all density-reachable objects and merge them also to the cluster
Density-based clustering - conclusions - • Advantages: • Minimal requirements of domain knowledge to determine the input parameters. • Discovery of clusters with arbitrary shape • Good efficiency on large databases • Disadvantages: • problems on data spaces with strongly different densities within different ranges • Bad efficiency on high dimensional databases
More clustering methods • Hierarchical methods (agglomerative, divisive) • Partitioning methods (i.e. k-means) • Density-based methods (i.e. dbscan) • Fuzzy clustering • Grid-based methods • Constraint based methods • High dimensional clustering
Clustering algorithms- conclusions - • Choosing a clustering algorithm for a particular problem is not trivial. • single algorithms cover only a part of the given requirements. (corresponding runtime-behavior, precision, influence of runaways...) • => there has no algorithm been found (yet), that would have an optimal usability for every purpose, and mankind is still waiting for the one to develop such an algorithm.