Efficient .PAMAE:.Parallel.-Medoids.Clustering: High Accuracy

PAMAE: Parallel -Medoids Clustering with High Accuracy and Efficiency 1 2 1 Hwanjun Song, Jae-Gil Lee, and Wook-Shin Han 1 Korea Advanced Institute of Science and Technology (KAIST) Pohang University of Science and Technology (POSTECH) 2

Content • Introduction • Related Work • Algorithm Proposal • Algorithm Implementation • Experiment • Summary

Clustering • Definition • Clustering is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities • Application • Customer segmentation and City planning

Representative Clustering Method (1/2) • -Means: commonly used clustering method • Problem 1: sensitive to outlier • New centroid is the average point of the cluster • Problem 2: convergence to local optimum • New centroid is found only from the inside of the cluster outlier (a) ideal case (b) real case seed (b) real case (a) ideal case

Representative Clustering Method (2/2) • PAM[1]: representative k-Medoids algorithm for clustering • Advantage 1: robust to outlier • Medoidis real point to minimize the clustering error • Advantage 2: robust to local minimum • Next medoid can be found from the outside of the corresponding cluster • Problem: high computational complexity • Need to calculate distances from medoids to all points • Computational complexity:

Medoid Search Method • Global Search← -Medoids(PAM) • All points can be candidate for next center • Reason of high computational complexity • Local Search ← -Means • Only points inside of cluster can be candidate for next center • Reason of local optimum center center center center center center center center

Non-parallel -medoids algorithm • CLARA[1] • Runs PAM algorithm on samples whose size is and select best -Medoids set minimizing clustering error • Author chose random samples ofobjects according to their experiments • CLARA ends quickly by performing PAM on samples • CVI[2], IFART[3] • Uses density-based, fuzzy art method to find initial Medoids • Reduce the number of iteration for convergence • MEM[4] • Stores the distances for all possible pairs of data objects in advance and performs local search to update Medoids • CLARANS[5] • Considers a graph in which every node is a potential solution and dynamically draws on a sample of neighbors to confine the search space in the graph • FAMES[6] • Finds quickly suitable Medoids using geometric computation rather than random selection

Parallel -medoids algorithm (1/2) • PAM MapReduce[7] • Distributed algorithm using MapReduce framework • Uses local search in Medoid search phase • PAMES MapReduce[8] • Similar to PAM MapReduce • Uses geometrical method to search new Medoids Mapper Reducer Search new Medoid Assign to nearest Medoid End Yes Select initial Medoids Converge? Search new Medoid Assign to nearest Medoid No Update Medoids

Parallel -medoids algorithm (2/2) • CLARA MapReduce[9] • An extension of CLARA for MapReduce, executes two MapReduce jobs that run PAM on samples and select the best set in parallel • Author chose 5 random samples of objects according to their experiments • MR-KMEDIAN[10] • Constructs a single best sample by parallel, iterative sampling and runs a weighted k-Medoids algorithm over the sample • GREEDI[11] • Runs a greedy algorithm to find a candidate set of -Medoids from each partition of entire data, where a partition can be considered as a sample • Runs the greedy algorithm over the set of Medoids to find the final set of -Medoids

Limitation of previous algorithms (1/3) • High Elapsed Time (Low Efficiency) • CLARA-MR, FAMES-MR, GREEDI and MR-KMEDIAN are only possible when dealing with large amount of data Non-parallel Parallel Graph MapReduce CLARANS PAM-MR Memory Impossible Seed FAMES-MR MEM CVI GREEDI Geometric IFART MR-KMEDIAN FAMES Sampling Possible CLARA-MR CLARA

Limitation of previous algorithms (2/3) • High Clustering Error (Low Accuracy) • Problems of Local search based k-Medoids Algorithm • PAM-MEM, PAM-MR , FAMES, and FAMES-MR

Limitation of previous algorithms (3/3) • High Clustering Error (Low Accuracy) • Problems of sampling technique • CLARA, CLARA-MR, GREEDI and MR-KMEADIAN • Error is increasing if sampled points are far from the optimal Medoids • Error is dependent on the number of sample and size of sample Medoidpoint (a) Ideal result (b) Real result

Goal of Research • Propose new parallel -Medoids algorithm, PAMAE Non-parallel parallel

Overview of PAMAE (1/2) • Flow chart of PAMAE

Overview of PAMAE (2/2) • PAMAE consists of 2 phases • Use the result of phase 1 as the seed of phase 2 <Phase 2> <Phase 1> Sample 1 Parallel Refinement Parallel Seeding Sample 2 Sample 3 Result of Phase 2 bestSeed Result of Phase 1

Phase 1: Parallel Seeding • Theoretical assurance for local refinement(Phase 2) • The Phase 1 result, Seed, guarantee • Be close from the optimal Medoid () • Not fall into the local optimum () • Sampling errors depending on parameter values • Analyze the effect of sampling on the clustering error • Gaurantee that Phase 1 Seed is very close from optimal Medoid Optimal error Phase 1 error

Phase 2: Parallel Refinement • Sampling Probability • Analyze the probability that each Seed is a instance of each optimal cluster • Gaurantee that the result is not fall into local optimum if is • Only one iteration of refinement is enough to improve the quality of clustering

Algorithm Implementation Issue • Distributed processing platform, Hadoop vs Spark? • Hadoop is not good for processing the iterative algorithms • Need job latency and redundant reading of input dataset per job • Shuffle and Sort are required after map process • Spark is more efficient to handle the iterative algorithms! • Low data parallelism • Previous MapReduce-Medoids algorithm did not guarantee high data parallelism[12]

Fine-Granularity Refinement • High data parallelism • One task remains idle if there are three processors an • Divides a cluster into multiple smaller partitions and then performs local refinement on each partition. • The Medoids of a cluster can be found from the Medoids of its partitions if random partitioning is performed on a sufficiently large cluster

Experiment Configuration • Computing cluster • 12 Microsoft Azure D12v2 instances located in Japan • 2.4GHz 4 Core CPU, 28GB RAM, 200GB SSD • 2 instances are master nodes and 10 instances are worker nodes • Distributed processing platform • Hadoop 2.7.1 and Spark 1.6.1 for distributed parallel processing • Real-world dataset used for experiments

Accuracy Result (1/3) • Meaningful differences between algorithms would not come out if we compared only the absolute errors • The optimal clustering error takes a significant proportion of in large-scale data sets. • Relative error of PAMAE separately before and after Phase 2 to look into the effect of each phase

Accuracy Result (2/3) • Accuracy Comparison of eight parallel -Medoids algorithm • The relative error to as low as • 7-24% that of PAM-MR • 1.9-13% that of CLARA-MR • 18-22% that of GREEDI (Fig (a)) • 1.0-1.6% that of MR-KMEDIAN (c) TeraClickLog150 (a) Covertype (b) Census1990

Accuracy Result (3/3) • Convergence in Phase 2 depending on seeding strategies • Our seeding strategy, The relative error converged immediately after the first iteration • Other strategies did not guarantee high quality of the seeds (c) TeraClickLog150 (a) Covertype (b) Census1990

Efficiency Result (1/2) • Efficiency comparison of eight parallel -Medoids algorithms • PAMAE-Spark is the best and outperformed CLARA-MR by up to 5.4 times • PAMAE-Spark is much faster than PAMAE-Hadoop because of the advantage of Spark over Hadoop (a) Covertype (b) Census1990 (c) TeraClickLog150

Efficiency Result (2/2) • Proportion of the elapsed time in PAMAE-Spark ( is 50) • When a data set is pretty small, the proportion of Steps 1 and 2 was dominant since the computational complexity of Step 2 is the highest • The proportion of Phase 2(Steps 4 and 5) increased as the data size did from left to right, because Phase 2 is sensitive to the data size

Efficiency Result (2/2) • Scalability test on Spark • PAMAE-Spark achieves near-linear scalability in that the total elapsed time increased also by 9.1 times when the data size increased from 30GB to 300GB by 10 times

Summary • -Medoids algorithm is not appropriate to handle large amount of data • Although many studies succeed to reduce the high complexity of k-Medoids algorithm, did not consider high accuracy of algorithm • We propose PAMAE which consist of 2 phases, Parallel Seeding and Parallel Refinement • PAMAE significantly outperforms most of recent parallel algorithm and, at the same time, produces a clustering quality as comparable as the previous most-accurate algorithm

Reference [1] Kaufman, Leonard., and Rousseeuw, Peter J. (1987). Clustering By Means of Medoids, North-Holland [2] Pardeshi, Bharat., and Toshniwal, Durga. (2010). "Improved k-medoids Clustering based on Cluster Validity Index and Object Density." 2th IEEE International Conference on Advance Computing (IACC) 2010, Patiala, India, pp. 379-384 [3] Omurca, SevincIlhan., and Duru, Nevcihan. (2011). "Decreasing Iteration Number of k-medoids Algorithm with IFART." 7th IEEE International Conference on Electrical and Electronics Engineering (ELECO) 2011, Bursa, Turcky, pp. 454-456 [4] Park, Hae-Sang., and Jun, Chi-Hyuck. (2009). "A Simple and Fast Algorithm for k-medoids Clustering." International Journal of Expert Systems with Applications, 36(2), pp. 3336-3341 [5] Ng, Raymond T., and Han, Jiawei. (2002). "Clarans: A Method For Clustering Objects for Spatial Data Mining." IEEE Trans, 14(5), pp. 1003-1016 [6] Paterlini, Adriano Arantes., Nascimento, Mario A., and Jr, Caetano Traina. (2011). "Using Pivots to Speed-up k-medoids Clustering." International Journal of Information and Data Management, 2(2), pp. 221-236 [7] Yang, Xianfeng., and Lian, Liming. (2014). "A New Data Mining Algorithm based on MapReduce and Hadoop." International Journal of Signal Processing, Image processing and Pattern Recognition, 7(2), pp. 131-142 [8] Zhu, Ying-ting., Wang, Fu-zhang., Shan, xing-hua., and Lv, Xiao-yan. (2014). "k-medoids Clustering based on MapReduce and Optimal Search of Medoids." 9th IEEE International Conference on Computer Science and Education (ICCSE) 2014, Oslo, Norway, pp. 64-71 [9] Jakovits, Pelle., and Srirama, Satish Narayana. (2013). "Clustering on the Cloud: Reducing CLARA to MapReduce." 2th Nordic Symposium on Cloud Computing and Internet Technologies 2013, Oslo, Norway, pp. 64-71. [10] Alina Ene, SungjinIm, and Benjamin Moseley. 2011. Fast Clustering using MapReduce. In Proc. 17th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. 681-689. [11] BaharanMirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. 2013. Distributed Submodular Maximization: Identifying Representative Elements in Massive Data. In Proc. 27th Annual Conf. on Neural Information Processing Systems. 2049-2057. [12] W. Daniel Hillis and Guy L. Steele, Jr. 1986. Data Parallel Algorithms. Comm. Of the ACM 29, 12 (1986), 1170-1183.

Efficient .PAMAE:.Parallel.-Medoids.Clustering: High Accuracy