1 / 29

HJ -Hadoop An Optimized MapReduce Runtime for Multi-core Systems

HJ -Hadoop An Optimized MapReduce Runtime for Multi-core Systems. Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University. Social Informatics. Big Data Era. 2008. 20 PB/day. 100 PB media. 2012. 120 PB cluster. 2011. Bing ~ 300 PB. 2012. MapReduce Runtime. Map.

hetal
Download Presentation

HJ -Hadoop An Optimized MapReduce Runtime for Multi-core Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HJ-HadoopAn Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University

  2. Slide borrowed from Prof. Geoffrey Fox’s presentation Social Informatics

  3. Big Data Era 2008 20 PB/day 100 PB media 2012 120 PB cluster 2011 Bing ~ 300 PB 2012 Slide borrowed from Florin Dinu’s presentation

  4. MapReduce Runtime Map Reduce Map Map Reduce Job Starts Job Ends ……. Figure 1. Map Reduce Programming Model

  5. Hadoop Map Reduce • Open source implementation of Map Reduce Runtime system • Scalable • Reliable • Available • Popular platform for big data analytics

  6. Habanero Java(HJ) • Programming Language and Runtime Developed at Rice University • Optimized for multi-core systems • Lightweight async task • Work sharing runtime • Dynamic task parallelism • http://habanero.rice.edu

  7. Kmeans

  8. Kmeans Machine1 Machine1 Machine1 Machine1 Machine1 Full Lookup Table Full Lookup Table Map task in a JVM Map task in a JVM Map task in a JVM Map task in a JVM Map task in a JVM Computation Computation Computation Computation Computation Memory Memory Memory Memory Memory Map: Look up a Key in Lookup Table Map: Look up a Key in Lookup Table Kmeans is an application that takes as input a large number of documents and try to classify them into different topics Look up key Look up key Look up key Look up key Look up key Topics ClusterCentroids Duplicated Tables Duplicated Tables Duplicated Tables Duplicated Tables Duplicated Tables Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Thread 1 Thread 1 To be Classified Documents To be Classified Documents Join Table Join Table Join Table Join Table Join Table Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x <Key, Val> pairs… <Key, Val> pairs… JoinTable JoinTable Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x slice1 slice1 slice1 slice1 slice1 slice2 slice2 slice2 slice2 slice2 Reducer1 Reducer1 Reducer1 Reducer1 Reducer1 Slice 1 Slice 1 Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x slice3 slice3 slice3 slice3 slice3 <Key, Val> pairs… <Key, Val> pairs… slice4 slice4 slice4 slice4 slice4 Machine2 Machine2 Machine2 Machine2 Machine2 Reducer Reducer Slice 2 Slice 2 slice5 slice5 slice5 slice5 slice5 Full Lookup Table Full Lookup Table slice6 slice6 slice6 slice6 slice6 Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Reducer1 Reducer1 Reducer1 Reducer1 Reducer1 Thread 1 Thread 1 slice7 slice7 slice7 slice7 slice7 Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x slice8 slice8 slice8 slice8 slice8 Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Slices …. Slices …. …. …. …. …. …. Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Full Lookup Table 1x Slice n Slice n Machines….. Machines….. Machines….. Machines….. Machines…..

  9. Kmeans using Hadoop To be classified documents Map task in a JVM Machine 1 Computation Memory slice1 slice2 slice3 slice4 slice5 slice6 slice7 slice8 Machines … …. To be classified documents Machine1 Map task in a JVM slice1 Computation Memory slice2 slice3 Duplicated In-memory Cluster Centroids slice4 slice5 slice6 slice7 Cluster Centroids 1x Cluster Centroids 1x Topics Cluster Centroids 1x Cluster Centroids 1x slice8 …. Machines …

  10. Kmeans using Hadoop To be classified documents Map task in a JVM Machine 1 Computation Memory slice1 slice2 slice3 slice4 slice5 slice6 slice7 slice8 Machines … …. To be classified documents Machine1 Map task in a JVM slice1 Computation Memory slice2 slice3 Duplicated In-memory Cluster Centroids slice4 slice5 slice6 slice7 Cluster Centroids 1x Topics Cluster Centroids 1x Cluster Centroids 1x Topics Cluster Centroids 1x slice8 …. Machines …

  11. Kmeans using Hadoop To be classified documents Map task in a JVM Machine 1 Computation Memory slice1 slice2 slice3 slice4 slice5 slice6 slice7 slice8 Machines … …. To be classified documents Machine1 Map task in a JVM slice1 Computation Memory slice2 slice3 Duplicated In-memory Cluster Centroids slice4 slice5 slice6 slice7 Cluster Centroids 1x Topics Topics Topics Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x slice8 …. Machines …

  12. Kmeans using Hadoop To be classified documents Map task in a JVM Machine 1 Computation Memory slice1 slice2 slice3 slice4 slice5 slice6 slice7 slice8 Machines … …. To be classified documents Machine1 Map task in a JVM slice1 Computation Memory slice2 slice3 Duplicated In-memory Cluster Centroids slice4 slice5 slice6 slice7 Cluster Centroids 1x Topics Topics Topics Topics Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x slice8 …. Machines …

  13. Kmeans using Hadoop To be classified documents Map task in a JVM Machine 1 Computation Memory slice1 slice2 Duplicated In-memory Cluster Centroids slice3 slice4 slice5 slice6 slice7 slice8 Machines … …. To be classified documents Machine1 Map task in a JVM slice1 Computation Memory slice2 slice3 Duplicated In-memory Cluster Centroids slice4 slice5 slice6 slice7 Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x Topics 1x Topics 1x Topics 1x Topics 1x Cluster Centroids 1x slice8 …. Machines …

  14. Memory Wall We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.

  15. Memory Wall • Hadoop’s approach to the problem • Increase the memory available to each Map Task JVM by reducing the number of map tasks assigned to each machine.

  16. Kmeans using Hadoop To be classified documents Map task in a JVM Machine 1 Computation Memory slice1 Topics 2x slice2 slice3 slice4 Topics 2x slice5 slice6 slice7 slice8 Machines … …. To be classified documents Machine1 Map task in a JVM slice1 Computation Memory slice2 slice3 Duplicated In-memory Cluster Centroids slice4 slice5 slice6 slice7 Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x slice8 …. Machines …

  17. Memory Wall Decreased throughput due to reduced number of map tasks per machine We used 8 mappers from 30 -80 MB, 4 mappers for 100 – 150 MB, 2 mappers for 180 – 380 for sequential Hadoop.

  18. HJ-HadoopApproach1 Map task in a JVM To be classified documents Machine1 Computation Memory Dynamic chunking Machines … To be classified documents Map task in a JVM Machine1 slice1 slice1 Computation Memory slice2 slice2 slice3 slice3 Dynamic chunking No Duplicated In-memory Cluster Centroids slice4 slice4 slice5 slice5 slice6 slice6 slice7 slice7 Cluster Centroids 4x Topics 4x slice8 slice8 …. …. Machines …

  19. HJ-HadoopApproach1 Map task in a JVM To be classified documents Machine1 Computation Memory Dynamic chunking No Duplicated In-memory Cluster Centroids Machines … To be classified documents Map task in a JVM Machine1 slice1 slice1 Computation Memory slice2 slice2 slice3 slice3 Dynamic chunking No Duplicated In-memory Cluster Centroids slice4 slice4 slice5 slice5 slice6 slice6 slice7 slice7 Cluster Centroids 4x Topics 4x slice8 slice8 …. …. Machines …

  20. Results We used 2 mappers for HJ-Hadoop

  21. Results Process 5x Topics efficiently We used 2 mappers for HJ-Hadoop

  22. Results 4x throughput improvement We used 2 mappers for HJ-Hadoop

  23. HJ-HadoopApproach1 Map task in a JVM To be classified documents Machine1 Computation Memory Dynamic chunking Machines … To be classified documents Map task in a JVM Machine1 slice1 slice1 Computation Memory slice2 slice2 slice3 slice3 Dynamic chunking No Duplicated In-memory Cluster Centroids slice4 slice4 slice5 slice5 slice6 slice6 slice7 slice7 Cluster Centroids 4x Topics 4x slice8 slice8 …. …. Machines …

  24. HJ-HadoopApproach1 Onlyasinglethreadreadinginput Map task in a JVM Machine1 Computation Memory Dynamic chunking Machines … To be classified documents Map task in a JVM Machine1 slice1 slice1 Computation Memory slice2 slice2 slice3 slice3 Dynamic chunking No Duplicated In-memory Cluster Centroids slice4 slice4 slice5 slice5 slice6 slice6 slice7 slice7 Cluster Centroids 4x Topics 4x slice8 slice8 …. …. Machines …

  25. Kmeans using Hadoop Fourthreadsreadinginput Map task in a JVM Machine 1 Computation Memory slice1 slice2 slice3 slice4 slice5 slice6 slice7 slice8 Machines … …. To be classified documents Machine1 Map task in a JVM slice1 Computation Memory slice2 slice3 Duplicated In-memory Cluster Centroids slice4 slice5 slice6 slice7 Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x Topics Topics Topics Topics slice8 …. Machines …

  26. HJ-HadoopApproach2 Fourthreadsreadinginput Map task in a JVM Machine 1 Computation Memory slice1 slice2 slice3 slice4 slice5 slice6 slice7 slice8 Machines … …. To be classified documents Machine1 Map task in a JVM slice1 Computation Memory slice2 slice3 Duplicated In-memory Cluster Centroids slice4 slice5 slice6 slice7 Topics Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x slice8 …. Machines …

  27. HJ-HadoopApproach2 Fourthreadsreadinginput Map task in a JVM Machine 1 Computation Memory slice1 slice2 No Duplicated In-memory Cluster Centroids slice3 slice4 slice5 slice6 slice7 slice8 Machines … …. To be classified documents Machine1 Map task in a JVM slice1 Computation Memory slice2 slice3 Duplicated In-memory Cluster Centroids slice4 slice5 slice6 slice7 Topics Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x Cluster Centroids 1x slice8 …. Machines …

  28. Trade Offs between the two approaches • Approach1 • Minimummemoryoverhead • ImprovedCPUutilizationwithsmalltaskgranularity • Approach2 • ImprovedIOperformance • OverlapbetweenIOandComputation • Hybrid Approach • Improved IO with small memory overhead • Improved CPU utilization

  29. Conclusions • Our goal is to tackle the memory inefficiency in the execution of MapReduce applications on multi-core systems by integrating a shared memory parallel model into Hadoop MapReduce runtime • HJ-Hadoop can be used to solve larger problems efficiently than Hadoop, processing process 5x more data at full throughput of the system • The HJ-Hadoop can deliver a 4x throughput relative to Hadoop processing very large in-memory data sets

More Related