Clustering algorithms and methods

Clustering algorithms and methods Andreas Held - Review and usage - 28.June.2007

Content • What is a cluster and the clustering process • Proximity measures • Hierarchical clustering • Agglomerative • Divisive • Partitioning clustering • K-means • Density-based Clustering • DBSCAN

The Cluster • A cluster is a group or accumulation of objects with similar attributes • conditions to clusters: (i) Homogeneity within a cluster (ii) Heterogeneity to other clusters • possible objects in biology: - genes (transcriptomics) - individuals (plant systematics), - sequences (sequence analysis) Ruspini-dataset: artifical generated dataset

Objectives of Clustering • Generation of preferably homogeneous and heterogeneous clusters • Identification of categories, classes or groups in the data • Recognition of relations within the data • Concise Structuring of the data (e.g. dendogram)

The clustering process • the expression levels of genes under • different conditions experimental data • take only the expression levels under • the conditions which interest • => attribute vectors: xi = (y1, …, ym) preprocessing • create the raw-data-matrix by writing the • attribute-vectors among each other raw-data-matrix • define the distance or similarity functions • and build the distance-matrix so that on • their rows and columns the objects are • confronted. proximity measures • choose a clustering algorithm and use it • on the data clustering algorithm

condition 2 Euclidian distance: 4 Euclidian distance 3 Manhattan distance: 2 Manhattan distance Maximum distance: 1 d(x, y) = maxi ( | xi - yi |) Maximum distance 1 2 3 4 condition 1 Distance functions for objects - d(x, y) calculates the distance between the two objects x and y - Distance measures: - Example:

Distance measures for cluster Calculating the distance between two clusters is important for some algorithms (e.g. hierarchical algorithms) Condition 2 Single Linkage: min {d(a, b) : a  A, b  B} Cluster Y 5 D Complete Linkage 4 Complete Linkage: max {d(a, b) : a  A, b  B} Average Linkage 3 C Single Linkage 2 B Average Linkage: 1 Cluster X A Condition 1 1 3 4 5 2

Differentiability of clustering algorithms

Hierarchical Clustering • Two methods of hierarchical Clustering: • agglomerative (bottom-up) • divise (top-down) • agglomerative vs. divisive: • divise and agglomerative methods produce the same results • divise algorithms need much more computing power so in practical only agglomerative methods are used. • Agglomrative algorithm example: UPGMA used in phylogenetics • Conditions: • given distance or similarity measure for objects • given distance measure for cluster • result is a dendrogram

Agglomerative hierarchical clustering Algorithm: Find the two Clusters with the closest distance and put those two Clusters into one. Compute the new Distance-Matrix. Construct the finest Partition and compute Distance matrix D Start with objects and given distance measure between clusters Until all clusters are agglomerated d Distance measures: - Manhattan Distance - Single Linkage 9 D c 8 E 7 C 6 b b 5 a c 4 3 B 2 a A 1 A B C D E 1 2 3 4 5 6 7 Clusters Dendrogram Distance-Matrix

Hierarchical clustering- conclusions - • Advantages: • Dendrogram allows interpretation • depending on the level of the dendogram the different clustering grades can be explored. • Usage on all data spaces if a distance measure can be defined • Disadvantages: • The user has to identify the clusters by himself • recalculations of the great distance-matrix makes the algorithm resource intensive • Higher runtimes vs. Non-hierarchic-methods

Partioning Clustering -k-means algorithm- • merge n objects into k cluster • calculate centroids from given clustering: ,ci centroid of cluster Ci • calculate clustering from given centroids: => merge objects into the cluster with minimum distance to its centroid

k-means algorithm principle • In general neither the centroids nor the clustering is known • Guessing Cluster center (centroid) Clustering

k-means Algorithm - euclidian distance - k = 3 0) Init: place randomly 3 cluster-centroids 1.0 0.9 1) Join every object into the cluster with the nearest cluster-centroid 0.8 0.7 2) Compute the new cluster-centroids from the given clustering 0.6 0.5 3) Repeat 1) and 2) until all centroids stop moving 0.4 0.3 0.2 • in each step the values get better for the centroids and the clustering 0.1 0.2 0.4 0.6 0.8 1.0

k-means algorithm- problems - • Not every run achieves the same result, because the result depends on random initiation of the clusters • Run the algorithm a couple of times and take the best result • fixed number of clusters need to be known before starting the algorithm => try different values for k and take the best result • the problem to compute the optimal number of clusters is not trivial. An approach is the elbow criterion.

k-means algorithm - advantages - • easy to implement • linear runtime allows execution on large databases • For example the clustering of microarray data: • depending on the experiment: 20.000 dimensional vectors

Partioning Clustering- density-based method - • Condition on the data space: data space where objects are closely together separated from areas where the objects are less closely together => Cluster with arbitrary shape get found

Density-based clustering - parameters - • : the environment around an object (o): all objects in the -environment of object o • MinPts: minimum number of objects, that have to be in an object-environment, so that this object is core object  o o

Density-based clustering - definitions - o • object o O is core object, if: | (o) | ≥ MinPts • object p  O is directly density-reachable from q  O, if p  (q) • A object p is density-reachable from an object q, if there is a chain of directly density-reachable objects between p and q. q p q p

Density-based clustering- example DBSCAN- Parameter: Algorithm: MinPts = 4 1) Explore incremental all objects : see below 2) find core object (e(o) ≥ MinPts = 4) 3) Start with a new cluster and merge the object to this cluster 4) Search for all density-reachable objects and merge them also to the cluster

Density-based clustering - conclusions - • Advantages: • Minimal requirements of domain knowledge to determine the input parameters. • Discovery of clusters with arbitrary shape • Good efficiency on large databases • Disadvantages: • problems on data spaces with strongly different densities within different ranges • Bad efficiency on high dimensional databases

More clustering methods • Hierarchical methods (agglomerative, divisive) • Partitioning methods (i.e. k-means) • Density-based methods (i.e. dbscan) • Fuzzy clustering • Grid-based methods • Constraint based methods • High dimensional clustering

Clustering algorithms- conclusions - • Choosing a clustering algorithm for a particular problem is not trivial. • single algorithms cover only a part of the given requirements. (corresponding runtime-behavior, precision, influence of runaways...) • => there has no algorithm been found (yet), that would have an optimal usability for every purpose, and mankind is still waiting for the one to develop such an algorithm.

End

Clustering algorithms and methods

Clustering algorithms and methods

Presentation Transcript

Clustering Algorithms

Comparing Clustering Algorithms

Clustering Methods

Clustering Algorithms

Clustering Methods

Clustering Algorithms

Clustering Algorithms

Clustering Methods

Discerning Linkage-Based Algorithms Among Hierarchical Clustering Methods

Fuzzy Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Methods

Clustering methods

Clustering Algorithms

Clustering Algorithms