130 likes | 141 Views
Explore parallelization of KMeans algorithm using MapReduce in Hadoop, comparing IPKMeans and PKMeans methods. Includes IoT applications, Big Data challenges, experimental results, Hadoop cluster architecture, and references.
E N D
Patrick Killeen School of Computer Science University of Ottawa, Ottawa, Canada pkill013@uottawa.ca Parallelizing KMeans using MapReduce IPKMeans vs PKMeans
1. Introduction 1.1 Internet of Things – Big Data 1.2 KMeans Algorithm 1.3 Hadoop MapReduce 1.4 PKMeans 1.5 IPKMeans 2. My Experimental Results 2.1 Hadoop Cluster Architecture 2.2 Results 3. Questions 4. References Table of Contents
IoT and its applications Network connected devices Sensors and actuators Military asset tracking [7] Challenges: Big Data Big Data’s 5 Vs: Velocity, veracity, variety, value, and volume 1.1 Internet of Things – Big Data [8] [9]
Applications Opinion mining[13] Image pattern recognition[14] Stock exchange analysis[12] Steps Choose k initial random centroids (data points) Label points to their nearest centroids Recompute k centroids using their cluster’s average Go back to step 2 until convergence (centroids haven’t change) 1.2 KMeans Algorithm Figure 1. Example KMeans clustering result of 3 clusters Figure 2. Example KMeans centroids convergence of 3 clusters
Open source [1], based on Google’s work Hadoop cluster Many machines on a rack Huge files partitioned/split and stored on many machines MapReduce Slaves perform data analytic jobs on local data with following software component: Mapper: pre-processing phase Reducer: post-processing phase 1.3 Hadoop MapReduce [10] Figure 3. Overview of Example Hadoop Cluster
Mapper Labels data points to nearest centroid Any number of mappers Reducer Recomputes centroid using cluster average Number of reducers = k (number of clusters) PKMeans is proposed by [5] 1.4 PKMeans Figure 4. PKMeans example job, 5 mappers, 3 centroids, 3 reducers
Phase 1 Data Partitioning KDTree Create subgroups Phase 2 Parallel Kmeans Run KMeans on each subgroup Phase 3 Centroid Merging Pick best (most central) centroids found from phase 2 IPKMeans proposed by [4] 1.5 IPKMeans Figure 5. IPKMeans phase execution overview
Used Bitnami Hadoop VM Client Node Submit jobs SSH tunnel Master Node Manages jobs Service Node Job history management Slave Nodes Mapper Reducer Distributed Data Storage For more information on Hadoop configuration, see: [2][3] 2.1 Hadoop Cluster Architercutre Figure 6. My Openstack VM Hadoop Cluster Configuration
2.2 Results Figure 8. Increasing dataset size with 10 nodes and 7 reducers Figure 7. Data set with 3000 point and 3 Gaussian distributed clusters Figure 9. Varying initial centroids with 3000 points, 7 reducers, and 10 nodes Figure 10. Varying initial centroids with 84000 points, 7 reducers, and 10 nodes
Question 1: What is a centroid in the KMeans algorithm? Question 2: What is Hadoop? Question 3: What is Big Data? 3 Questions [11]
[1] Apache hadoop project. https://hadoop.apache.org/. Accessed: 2018-11-26. [2] Creating a hadoop cluster. https://docs.bitnami.com/bch/apps/hadoop/getstarted/hadoop-cluster/. Accessed: 2018-11-26. [3] Understanding hadoop clusters and the network. http://bradhedlund.com/2011/09/10/understandinghadoop-clusters-and-the- network/. Accessed: 2018-11-26. [4] Shikai Jin, Yuxuan Cui, and Chunli Yu. A new parallelization method for k- means. arXiv preprint arXiv:1608.06347, 2016. [5] Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In IEEE International Conference on Cloud Computing, pages 674–679. Springer, 2009. [7] Stonebraker M, Çetintemel U, Zdonik S. The 8 requirements of real-time stream processing. ACM SIGMOD Rec 2005;34(4):42–7 [8] https://www.wired.com/2012/06/wireless-power/ [9] https://www.smartdatacollective.com/can-business-intelligence-answer-questions-asked-without-big-data/ [10] https://commons.wikimedia.org/wiki/File:Hadoop_logo.svg [11] : http://images.clipartpanda.com/light-bulbs-Light-Bulb.jpg [12]:Oussama Lachiheb, Mohamed Salah Gouider, and Lamjed Ben Said. An improved mapreduce design of kmeans with iteration reducing for clustering stock exchange very large datasets. In Semantics, Knowledge and Grids (SKG), 2015 11th International Conference on, pages 252{255. IEEE, 2015. [13]: V Priya and K Umamaheswari. Ensemble based parallel k means using map reduce for aspect based summarization. In Proceedings of the International Conference on Informatics and Analytics, page 26. ACM, 2016. [14]: Anil R Surve and Nilesh S Paddune. A survey on hadoop assisted k-means clustering of hefty volume images. International Journal on Computer Science & Engineering, 6(3):113{117, 2014. 4 References