How to cluster data Algorithm review

How to cluster dataAlgorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND

University of Eastern Finland Joensuu Joki = a river Joen = of a river Suu = mouth Joensuu = mouth of a river

Research topics Voice biometric Location-based application Speaker recognition Voice activity detection Applications Mobile data collection Route reduction and compression Photo collections and social networks Location-aware services & search engine Clustering methods Clustering algorithms Clustering validity Graph clustering Gaussian mixture models Image processing& compression Lossless compression and data reduction Image denoising Ultrasonic, medical and HDR imaging

Research achievements Voice biometric • NIST SRE submission ranked #2 in four categories in NIST SRE 2006. • Top-1 most downloaded publication in Speech Communication Oct-Dec 2009 • Results used in Forensics Location-based application • Results used by companies in Finland Clustering methods • State-of-the-art algorithms! • 4 PhD degrees • 5 Top publications Image processing& compression • State-of-the-art algorithms in niche areas • 6 PhD degrees • 8 Top publications

Application example 1Color reconstruction Image with original colors Image with compression artifacts

Application example 2speaker modeling for voice biometrics Tomi Feature extraction and clustering Mikko Tomi Matti Matti Training data Mikko Feature extraction Speaker models ? Best match: Matti !

Speaker modeling Speech data Result of clustering

Application example 3Image segmentation Image with 4 color clusters Normalized color plots according to red and green components. green red

Application example 4Quantization Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values Quantized signal Original signal

Color quantization of images Color image RGB samples Clustering

Application example 5Clustering of spatial data

Clustered locations of users

Timeline clustering Clustering of photos Clustered locations of users

Clustering GPS trajectoriesMobile users, taxi routes, fleet management

Conclusions from clusters Cluster 2: Home Cluster 1: Office

Part I:Clustering problem

Definitions and data Set of N data points: X={x1, x2, …, xN} Partition of the data: P={p1, p2, …, pM}, Set of M cluster prototypes (centroids): C={c1, c2, …, cM},

K-means algorithm X = Data set C = Cluster centroids P = Partition K-Means(X, C) → (C, P) REPEAT Cprev←C; FOR all i∈[1, N] DOpi← FindNearest(xi, C); FOR all j∈[1, k] DO cj← Average of xipi = j; UNTIL C = Cprev Optimal partition Optimal centoids

Distance and cost function Euclidean distance of data vectors: Mean square error:

Clustering result as partition Partition of data Cluster prototypes Illustrated by Voronoi diagram Illustrated by Convex hulls

Duality of partition and centroids Partition of data Cluster prototypes Partition by nearestprototype mapping Centroids as prototypes

Challenges in clustering Incorrect cluster allocation Incorrect number of clusters Too many clusters Clusters missing Cluster missing

How to solve? Algorithmic problem Mathematical problem Computer science problem Solve the clustering: • Given input data (X) of N data vectors, and number of clusters (M), find the clusters. • Result given as a set of prototypes, or partition. Solve the number of clusters: • Define appropriate cluster validity function f. • Repeat the clustering algorithm for several M. • Select the best result according to f. Solve the problem efficiently.

Part II:Clustering algorithms

Algorithm 1:Split P. Fränti, T. Kaukoranta and O. Nevalainen, "On the splitting method for vector quantization codebook generation", Optical Engineering, 36 (11), 3043-3051, November 1997.

Divisive approach Motivation Efficiency of divide-and-conquer approach Hierarchy of clusters as a result Useful when solving the number of clusters Challenges Design problem 1: What cluster to split? Design problem 2: How to split? Sub-optimal local optimization at best

Split-based (divisive) clustering

Use this ! Select cluster to be split • Heuristic choices: • Cluster with highest variance (MSE) • Cluster with most skew distribution (3rd moment) • Locally optimal: • Tentatively split all clusters • Select the one that decreases MSE most! • Complexity of the choice: • Heuristics take the time to compute the measure • Optimal choice takes only twice (2) more time!!! • The measures can be stored, and only two new clusters appear at each step to be calculated.

Selection example Biggest MSE… 11.6 6.5 7.5 4.3 11.2 8.2 … but dividing this decreases MSE more

Selection example 11.6 6.5 7.5 4.3 6.3 8.2 4.1 Only two new values need to be calculated

How to split • Centroid methods: • Heuristic 1: Replace C by C- and C+ • Heuristic 2: Two furthest vectors. • Heuristic 3: Two random vectors. • Partition according to principal axis: • Calculate principal axis • Select dividing point along the axis • Divide by a hyperplane • Calculate centroids of the two sub-clusters

Splitting along principal axispseudo code Step 1: Calculate the principal axis. Step 2: Select a dividing point. Step 3: Divide points by a hyper plane. Step 4: Calculate centroids of the new clusters.

Example of dividing Principal axis Dividing hyper plane

Optimal dividing pointpseudo code of Step 2 Step 2.1: Calculate projections on the principal axis. Step 2.2: Sort vectors according to the projection. Step 2.3: FOR each vector xi DO: - Divide using xi as dividing point. - Calculate distortion of subsets D1 and D2. Step 2.4: Choose point minimizing D1+D2.

Finding dividing point • Calculating error for next dividing point: • Update centroids: Can be done in O(1) time!!!

Sub-optimality of the split

Example of splitting process 2 clusters 3 clusters Principal axis Dividing hyper plane

Example of splitting process 4 clusters 5 clusters

Example of splitting process 14 clusters 15 clusters MSE = 1.94

K-means refinement Result directly after split: MSE = 1.94 Result afterre-partition:MSE = 1.39 Result after K-means: MSE = 1.33

Time complexity Number of processed vectors, assuming that clusters are always split into two equal halves: Assuming unequal split to nmax and nmin sizes:

Time complexity Number of vectors processed: At each step, sorting the vectors is bottleneck:

Algorithm 2:Pairwise Nearest Neighbor P. Fränti, T. Kaukoranta, D-F. Shen and K-S. Chang, "Fast and memory efficient implementation of the exact PNN", IEEE Trans. on Image Processing, 9 (5), 773-777, May 2000.

Agglomerative clustering Single link • Minimize distance of nearest vectors Complete link • Minimize distance of two furthest vectors Ward’s method • Minimize mean square error • In Vector Quantization, known as Pairwise Nearest Neighbor (PNN) method

PNN algorithm[Ward 1963: Journal of American Statistical Association] Merge cost: Local optimization strategy: Nearest neighbor search is needed: • finding the cluster pair to be merged • updating of NN pointers

Pseudo code

How to cluster data Algorithm review