1 / 13

Parallelizing KMeans using MapReduce IPKMeans vs PKMeans

Explore parallelization of KMeans algorithm using MapReduce in Hadoop, comparing IPKMeans and PKMeans methods. Includes IoT applications, Big Data challenges, experimental results, Hadoop cluster architecture, and references.

vondaw
Download Presentation

Parallelizing KMeans using MapReduce IPKMeans vs PKMeans

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Patrick Killeen School of Computer Science University of Ottawa, Ottawa, Canada pkill013@uottawa.ca Parallelizing KMeans using MapReduce IPKMeans vs PKMeans

  2. 1. Introduction 1.1 Internet of Things – Big Data 1.2 KMeans Algorithm 1.3 Hadoop MapReduce 1.4 PKMeans 1.5 IPKMeans 2. My Experimental Results 2.1 Hadoop Cluster Architecture 2.2 Results 3. Questions 4. References Table of Contents

  3. 1 Introduction

  4. IoT and its applications Network connected devices Sensors and actuators Military asset tracking [7] Challenges: Big Data Big Data’s 5 Vs: Velocity, veracity, variety, value, and volume 1.1 Internet of Things – Big Data [8] [9]

  5. Applications Opinion mining[13] Image pattern recognition[14] Stock exchange analysis[12] Steps Choose k initial random centroids (data points) Label points to their nearest centroids Recompute k centroids using their cluster’s average Go back to step 2 until convergence (centroids haven’t change) 1.2 KMeans Algorithm Figure 1. Example KMeans clustering result of 3 clusters Figure 2. Example KMeans centroids convergence of 3 clusters

  6. Open source [1], based on Google’s work Hadoop cluster Many machines on a rack Huge files partitioned/split and stored on many machines MapReduce Slaves perform data analytic jobs on local data with following software component: Mapper: pre-processing phase Reducer: post-processing phase 1.3 Hadoop MapReduce [10] Figure 3. Overview of Example Hadoop Cluster

  7. Mapper Labels data points to nearest centroid Any number of mappers Reducer Recomputes centroid using cluster average Number of reducers = k (number of clusters) PKMeans is proposed by [5] 1.4 PKMeans Figure 4. PKMeans example job, 5 mappers, 3 centroids, 3 reducers

  8. Phase 1 Data Partitioning KDTree Create subgroups Phase 2 Parallel Kmeans Run KMeans on each subgroup Phase 3 Centroid Merging Pick best (most central) centroids found from phase 2 IPKMeans proposed by [4] 1.5 IPKMeans Figure 5. IPKMeans phase execution overview

  9. 2 My Experimental Results

  10. Used Bitnami Hadoop VM Client Node Submit jobs SSH tunnel Master Node Manages jobs Service Node Job history management Slave Nodes Mapper Reducer Distributed Data Storage For more information on Hadoop configuration, see: [2][3] 2.1 Hadoop Cluster Architercutre Figure 6. My Openstack VM Hadoop Cluster Configuration

  11. 2.2 Results Figure 8. Increasing dataset size with 10 nodes and 7 reducers Figure 7. Data set with 3000 point and 3 Gaussian distributed clusters Figure 9. Varying initial centroids with 3000 points, 7 reducers, and 10 nodes Figure 10. Varying initial centroids with 84000 points, 7 reducers, and 10 nodes

  12. Question 1: What is a centroid in the KMeans algorithm? Question 2: What is Hadoop? Question 3: What is Big Data? 3 Questions [11]

  13. [1] Apache hadoop project. https://hadoop.apache.org/. Accessed: 2018-11-26. [2] Creating a hadoop cluster. https://docs.bitnami.com/bch/apps/hadoop/getstarted/hadoop-cluster/. Accessed: 2018-11-26. [3] Understanding hadoop clusters and the network. http://bradhedlund.com/2011/09/10/understandinghadoop-clusters-and-the- network/. Accessed: 2018-11-26. [4] Shikai Jin, Yuxuan Cui, and Chunli Yu. A new parallelization method for k- means. arXiv preprint arXiv:1608.06347, 2016. [5] Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In IEEE International Conference on Cloud Computing, pages 674–679. Springer, 2009. [7] Stonebraker M, Çetintemel U, Zdonik S. The 8 requirements of real-time stream processing. ACM SIGMOD Rec 2005;34(4):42–7 [8] https://www.wired.com/2012/06/wireless-power/ [9] https://www.smartdatacollective.com/can-business-intelligence-answer-questions-asked-without-big-data/ [10] https://commons.wikimedia.org/wiki/File:Hadoop_logo.svg [11] : http://images.clipartpanda.com/light-bulbs-Light-Bulb.jpg [12]:Oussama Lachiheb, Mohamed Salah Gouider, and Lamjed Ben Said. An improved mapreduce design of kmeans with iteration reducing for clustering stock exchange very large datasets. In Semantics, Knowledge and Grids (SKG), 2015 11th International Conference on, pages 252{255. IEEE, 2015. [13]: V Priya and K Umamaheswari. Ensemble based parallel k means using map reduce for aspect based summarization. In Proceedings of the International Conference on Informatics and Analytics, page 26. ACM, 2016. [14]: Anil R Surve and Nilesh S Paddune. A survey on hadoop assisted k-means clustering of hefty volume images. International Journal on Computer Science & Engineering, 6(3):113{117, 2014. 4 References

More Related