1.08k likes | 1.1k Views
How to cluster data Algorithm review. Extra material for DAA++ 18.2.2016. Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND. University of Eastern Finland. Joensuu. Joki = a river. Joen = of a river. Suu = mouth.
E N D
How to cluster dataAlgorithm review Extra material for DAA++ 18.2.2016 Prof. Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND
University of Eastern Finland Joensuu Joki = a river Joen = of a river Suu = mouth Joensuu = mouth of a river
Research topics Voice biometric Location-based application Speaker recognition Voice activity detection Applications Mobile data collection Route reduction and compression Photo collections and social networks Location-aware services & search engine Clustering methods Clustering algorithms Clustering validity Graph clustering Gaussian mixture models Image processing& compression Lossless compression and data reduction Image denoising Ultrasonic, medical and HDR imaging
Research achievements Voice biometric • NIST SRE submission ranked #2 in four categories in NIST SRE 2006. • Top-1 most downloaded publication in Speech Communication Oct-Dec 2009 • Results used in Forensics Location-based application • Results used by companies in Finland Clustering methods • State-of-the-art algorithms! • 4 PhD degrees • 5 Top publications Image processing& compression • State-of-the-art algorithms in niche areas • 6 PhD degrees • 8 Top publications
Application example 1Color reconstruction Image with original colors Image with compression artifacts
Application example 2speaker modeling for voice biometrics Tomi Feature extraction and clustering Mikko Tomi Matti Matti Training data Mikko Feature extraction Speaker models ? Best match: Matti !
Speaker modeling Speech data Result of clustering
Application example 3Image segmentation Image with 4 color clusters Normalized color plots according to red and green components. green red
Application example 4Quantization Approximation of continuous range values (or a very large set of possible discrete values) by a small set of discrete symbols or integer values Quantized signal Original signal
Color quantization of images Color image RGB samples Clustering
Timeline clustering Clustering of photos Clustered locations of users
Clustering GPS trajectoriesMobile users, taxi routes, fleet management
Conclusions from clusters Cluster 2: Home Cluster 1: Office
Definitions and data Set of N data points: X={x1, x2, …, xN} Partition of the data: P={p1, p2, …, pM}, Set of M cluster prototypes (centroids): C={c1, c2, …, cM},
K-means algorithm X = Data set C = Cluster centroids P = Partition K-Means(X, C) → (C, P) REPEAT Cprev←C; FOR all i∈[1, N] DOpi← FindNearest(xi, C); FOR all j∈[1, k] DO cj← Average of xipi = j; UNTIL C = Cprev Optimal partition Optimal centoids
Distance and cost function Euclidean distance of data vectors: Mean square error:
Clustering result as partition Partition of data Cluster prototypes Illustrated by Voronoi diagram Illustrated by Convex hulls
Duality of partition and centroids Partition of data Cluster prototypes Partition by nearestprototype mapping Centroids as prototypes
Challenges in clustering Incorrect cluster allocation Incorrect number of clusters Too many clusters Clusters missing Cluster missing
How to solve? Algorithmic problem Mathematical problem Computer science problem Solve the clustering: • Given input data (X) of N data vectors, and number of clusters (M), find the clusters. • Result given as a set of prototypes, or partition. Solve the number of clusters: • Define appropriate cluster validity function f. • Repeat the clustering algorithm for several M. • Select the best result according to f. Solve the problem efficiently.
Algorithm 1:Split P. Fränti, T. Kaukoranta and O. Nevalainen, "On the splitting method for vector quantization codebook generation", Optical Engineering, 36 (11), 3043-3051, November 1997.
Divisive approach Motivation Efficiency of divide-and-conquer approach Hierarchy of clusters as a result Useful when solving the number of clusters Challenges Design problem 1: What cluster to split? Design problem 2: How to split? Sub-optimal local optimization at best
Use this ! Select cluster to be split • Heuristic choices: • Cluster with highest variance (MSE) • Cluster with most skew distribution (3rd moment) • Locally optimal: • Tentatively split all clusters • Select the one that decreases MSE most! • Complexity of the choice: • Heuristics take the time to compute the measure • Optimal choice takes only twice (2) more time!!! • The measures can be stored, and only two new clusters appear at each step to be calculated.
Selection example Biggest MSE… 11.6 6.5 7.5 4.3 11.2 8.2 … but dividing this decreases MSE more
Selection example 11.6 6.5 7.5 4.3 6.3 8.2 4.1 Only two new values need to be calculated
How to split • Centroid methods: • Heuristic 1: Replace C by C- and C+ • Heuristic 2: Two furthest vectors. • Heuristic 3: Two random vectors. • Partition according to principal axis: • Calculate principal axis • Select dividing point along the axis • Divide by a hyperplane • Calculate centroids of the two sub-clusters
Splitting along principal axispseudo code Step 1: Calculate the principal axis. Step 2: Select a dividing point. Step 3: Divide points by a hyper plane. Step 4: Calculate centroids of the new clusters.
Example of dividing Principal axis Dividing hyper plane
Optimal dividing pointpseudo code of Step 2 Step 2.1: Calculate projections on the principal axis. Step 2.2: Sort vectors according to the projection. Step 2.3: FOR each vector xi DO: - Divide using xi as dividing point. - Calculate distortion of subsets D1 and D2. Step 2.4: Choose point minimizing D1+D2.
Finding dividing point • Calculating error for next dividing point: • Update centroids: Can be done in O(1) time!!!
Example of splitting process 2 clusters 3 clusters Principal axis Dividing hyper plane
Example of splitting process 4 clusters 5 clusters
Example of splitting process 6 clusters 7 clusters
Example of splitting process 8 clusters 9 clusters
Example of splitting process 10 clusters 11 clusters
Example of splitting process 12 clusters 13 clusters
Example of splitting process 14 clusters 15 clusters MSE = 1.94
K-means refinement Result directly after split: MSE = 1.94 Result afterre-partition:MSE = 1.39 Result after K-means: MSE = 1.33
Time complexity Number of processed vectors, assuming that clusters are always split into two equal halves: Assuming unequal split to nmax and nmin sizes:
Time complexity Number of vectors processed: At each step, sorting the vectors is bottleneck:
Algorithm 2:Pairwise Nearest Neighbor P. Fränti, T. Kaukoranta, D-F. Shen and K-S. Chang, "Fast and memory efficient implementation of the exact PNN", IEEE Trans. on Image Processing, 9 (5), 773-777, May 2000.
Agglomerative clustering Single link • Minimize distance of nearest vectors Complete link • Minimize distance of two furthest vectors Ward’s method • Minimize mean square error • In Vector Quantization, known as Pairwise Nearest Neighbor (PNN) method
PNN algorithm[Ward 1963: Journal of American Statistical Association] Merge cost: Local optimization strategy: Nearest neighbor search is needed: • finding the cluster pair to be merged • updating of NN pointers