400 likes | 642 Views
Clustering. An overview of clustering algorithms Dènis de Keijzer GIA 2004. Overview. Algorithms GRAVIclust AUTOCLUST AUTOCLUST+ 3D Boundary-based Clustering SNN. Gravity based spatial clustering. GRAVIclust Initialisation Phase calculate the initial centre clusters
E N D
Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004
Overview • Algorithms • GRAVIclust • AUTOCLUST • AUTOCLUST+ • 3D Boundary-based Clustering • SNN
Gravity based spatial clustering • GRAVIclust • Initialisation Phase • calculate the initial centre clusters • Optimisation Phase • improve the position of the cluster centres so as to achieve a solution which minimizes the distance function
GRAVIclust: Initialisation Phase • Input: • set of points P
GRAVIclust: Initialisation Phase • Input: • set of points P • matrix of distances between all pairs of points • assumption: actual access path distance • exists in GIS maps • e.g.. http://www.transinfo.qld.gov.au • very versatile • footpath • road map • rail map
GRAVIclust: Initialisation Phase • Input: • set of points P • matrix of distances between all pairs of points • # of required clusters k
GRAVIclust: Initialisation Phase • Step 1: • calculate first initial centre • the point with the largest number of points within radius r • remove first initial centre & all points within radius r from further consideration • Step 2: • repeat Step 1 until k initial centres have been chosen • Step 3: • create initial clusters by assigning all points to the closest cluster centre
GRAVIclust: radius calculation • Radius r • calculated based on the area of the region considered for clustering • static radius • based on the assumption that all clusters are of the same size • dynamic radius • recalculated after each initial cluster centre is chosen
GRAVIclust: Static vs. Dynamic • Static • reduced computation • # points within a radius r has to be calculated only once • not suitable for problems where the points are separated by large empty areas • Dynamic • increases computation time • ensures the radius is adjusted as the points are removed • Differs only when distribution is non-uniform
GRAVIclust: Optimisation Phase • Step 1: • for each cluster, calculate new centre • based on the the point closest to cluster centre of gravity • Step 2: • re-assign points to new cluster centres • Step 3: • recalculate distance function • never greater than previous • Step 4: • repeat Step 1 to 3 until value distance function equals previous
GRAVIclust • Deterministic • Can handle obstacles • Monotonic convergence of the distance function to a stable point
AUTOCLUST • Definitions
AUTOCLUST • Definitions II
AUTOCLUST • Phase 1: • finding boundaries • Phase 2: • restoring and re-attaching • Phase 3: • detecting second-order inconsistency
AUTOCLUST: Phase 1 • Finding boundaries • Calculate • Delaunay Diagram • for each point pi • ShortEdges(pi) • LongEdges(pi) • OtherEdges(pi) • Remove • ShortEdges(pi) and LongEdges(pi)
AUTOCLUST: Phase 2 • Restoring and re-attaching • for each point pi where ShortEdges(pi) • Determine a candidate connected component C for pi • If there are 2 edges ej = (pi,pj) and ek = (pi,pk) in ShortEdges(pi) with CC[pj] CC[pk], then • Compute, for each edge e = (pi,pj) ShortEdges(pi), the size ||CC[pj]|| and let M = maxe = (pi,pj) ShortEdges(pi) ||CC[pj]|| • Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to pi)
AUTOCLUST: Phase 2 • Restoring and re-attaching • for each point pi where ShortEdges(pi) • Determine a candidate connected component C for pi • If … • Otherwise, let C be the label of the connected component all edges e ShortEdges(pi) connect pi to
AUTOCLUST: Phase 2 • Restoring and re-attaching • for each point pi where ShortEdges(pi) • Determine a candidate connected component C for pi • If the edges in OtherEdges(pi) connect to a connected component different than C, remove them. Note that • all edges in OtherEdges(pi) are removed, and • only in this case, will pi swap connected components • Add all edges e ShortEdges(pi) that connect to C
AUTOCLUST: Phase 3 • Detecting second-order inconsistency • compute the LocalMean for 2-neighbourhoods • remove all edges in N2,G(pi) that are long edges
AUTOCLUST • No user supplied arguments • eliminates expensive human-based exploration time for finding best-fit arguments • Robust to noise, outliers, bridges and type of distribution • Able to detect clusters with arbitrary shapes, different sizes and different densities • Can handle multiple bridges • O(n log n)
AUTOCLUST+ • Construct Delaunay Diagram • Calculate MeanStDev(P) • For all edges e, remove e if it intersects some obstacles • Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps
3D Boundary-based Clustering • Benefits from 3D Clustering • more accurate spatial analysis • distinguish • positive clusters: • clusters in higher dimensions but not in lower dimensions
3D Boundary-based Clustering • Benefits from 3D Clustering • more accurate spatial analysis • distinguish • positive clusters: • clusters in higher dimensions but not in lower dimensions • negative clusters: • clusters in lower dimensions but not in higher dimensions
3D Boundary-based Clustering • Based on AUTOCLUST • Uses Delaunay Tetrahedrizations • Definitions: • ej potential inter-cluster edge if:
3D Boundary-based Clustering • Phase I • For all the piP, classify each edge ej incident to pi into one of three groups • ShortEdges(pi) when the length of ej is less than the range in AI(pi) • LongEdges(pi) when the length of ej is greater than the range in AI(pi) • OtherEdges(pi) when the length of ej is within AI(pi) • For all the piP, remove all edges in ShortEdges(pi) and LongEdges(pi)
3D Boundary-based Clustering • Phase II • Recuperate ShortEdges(pi) incident to border points using connected component analysis • Phase III • Remove exceptionally long edges in local regions
Shared Nearest Neighbour • Clustering in higher dimensions • Distances or similarities between points become more uniform, making clustering more difficult • Also, similarity between points can be misleading • i.e.. a point can be more similar to a point that “actually” belongs to a different cluster • Solution • Shared nearest neighbor approach to similarity
SNN: An alternative definition of similarity • Euclidian distance • most common distance metric used • while useful in low dimensions, it doesn’t work well in high dimensions
SNN: An alternative definition of similarity • Define similarity in terms of their shared nearest neighbours • the similarity of the points is “confirmed” by their common shared nearest neighbours
SNN: An alternative definition ofdensity • SNN similarity, with the k-nearest neighbour approach • if the k-nearest neighbour of a point, with respect to SNN similarity is close, then we say that there is a high density at this point • since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space
SNN: Algorithm • Compute the similarity matrix • corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points
SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix by keeping only the k most similar neighbours • corresponds to keeping only the k strongest links of the similarity graph
SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared nearest neighbour graph from the sparsified similarity matrix
SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared … • Find the SNN density of each point • Find the core points
SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared … • Find the SNN density of each point
SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared … • Find the SNN density of each point • Form clusters from the core points
SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared … • Find the SNN density of each point • Form clusters from the core points • Discard all noise points
SNN: Algorithm • Compute the similarity matrix • Sparsify the similarity matrix … • Construct the shared … • Find the SNN density of each point • Form clusters from the core points • Discard all noise points • Assign al non-noise, non-core points to clusters
Shared Nearest Neighbour • Finds clusters of varying shapes, sizes, and densities, even in the presence of noise and outliers • Handles data of high dimentionality and varying densities • Automaticly detects the # of clusters