280 likes | 296 Views
Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters. Jiong Xie Ph.D. Student April 2010. Presentation Outline. Background Motivation Related Work Design and Implementation Experimental Result Conclusion/Future Work. Background.
E N D
Improving MapReduce Performance through Data Placement inHeterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010
Presentation Outline • Background • Motivation • Related Work • Design and Implementation • Experimental Result • Conclusion/Future Work
Background MapReduce programming model is growing in popularity Hadoop is used by Yahoo, Facebook, Amazon.
Data Intensive Applications Bioinformatics Weather forecast Medicine science Astronautics
Hadoop Overview (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)
Hadoop Distributed File System (http://lucene.apache.org/hadoop)
Motivational Example 1 task/min Node A (fast) 2x slower Node B (slow) Node C (slowest) 3x slower Time (min)
The Native Strategy Node A 6 tasks Node B 3 tasks Node C 2 tasks Time (min) Loading Transferring Processing
Our Solution--Reducing data transfer time Node A 6 tasks Node A’ 3 tasks Node B’ Node C’ 2 tasks Time (min) Loading Transferring Processing 9
Preliminary Results Impact of data placement on performance of grep
Challenges Does computing ratio depend on the application? Initial data distribution Data skew problem New data arrival Data deletion New joining node Data updating
Measure Computing Ratios • Computing ratio • Fast machines process large data sets 1 task/min Node A 2x slower Node B Node C 3x slower Time
Steps to MeasureComputing Ratios 1. Run the application on each node with the same size data, individually collect the response time 2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes 3.Caculate the least common multiple of these ratios 4. Count the portion of each node
Initial Data Distribution 1 4 2 5 3 a 7 b 8 9 Portion3:2:1 • Input files split into 64MB blocks • Round-robin data distribution algorithm Namenode File1 6 A B C c Datanodes
Data Redistribution 1.Get network topology, the ratio and utilization 2.Build and sort two lists: under-utilized node list L1 over-utilized node list L2 3. Select the source and destination node from the lists. 4.Transfer data 5.Repeat step 3, 4 until the list is empty. 1 4 3 2 Namenode A C L1 L2 B Portion3:2:1 A B C 7 1 4 a 6 8 2 b 5 9 c 3
Sharing Files among Multiple Applications The computing ratio depends on data-intensive applications. Redistribution Redundancy
Experimental Environment Five nodes in a hadoop heterogeneous cluster
Grep and WordCount Grep is a tool searching for a regular expression in a text file WordCount is a program used to count words in a text file
Computing ratio for two applications Computing ratio of the five nodes with respective of Grep and Wordcount applications
Response time of Grep andwordcount in each Node Application dependence Data size independence
Conclusion • Identify the performance degradation caused by heterogeneity. • Designed and implemented a data placement mechanism in HDFS.
Future Work • Data redundancy issue • Dynamic data distribution mechanism • Prefetching
Question ?