Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters

Improving MapReduce Performance through Data Placement inHeterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010

Presentation Outline • Background • Motivation • Related Work • Design and Implementation • Experimental Result • Conclusion/Future Work

Background MapReduce programming model is growing in popularity Hadoop is used by Yahoo, Facebook, Amazon.

Data Intensive Applications Bioinformatics Weather forecast Medicine science Astronautics

Hadoop Overview (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)

Hadoop Distributed File System (http://lucene.apache.org/hadoop)

Motivational Example 1 task/min Node A (fast) 2x slower Node B (slow) Node C (slowest) 3x slower Time (min)

The Native Strategy Node A 6 tasks Node B 3 tasks Node C 2 tasks Time (min) Loading Transferring Processing

Our Solution--Reducing data transfer time Node A 6 tasks Node A’ 3 tasks Node B’ Node C’ 2 tasks Time (min) Loading Transferring Processing 9

Preliminary Results Impact of data placement on performance of grep

Challenges Does computing ratio depend on the application? Initial data distribution Data skew problem New data arrival Data deletion New joining node Data updating

Measure Computing Ratios • Computing ratio • Fast machines process large data sets 1 task/min Node A 2x slower Node B Node C 3x slower Time

Steps to MeasureComputing Ratios 1. Run the application on each node with the same size data, individually collect the response time 2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes 3.Caculate the least common multiple of these ratios 4. Count the portion of each node

Initial Data Distribution 1 4 2 5 3 a 7 b 8 9 Portion3:2:1 • Input files split into 64MB blocks • Round-robin data distribution algorithm Namenode File1 6 A B C c Datanodes

Data Redistribution 1.Get network topology, the ratio and utilization 2.Build and sort two lists: under-utilized node list  L1 over-utilized node list  L2 3. Select the source and destination node from the lists. 4.Transfer data 5.Repeat step 3, 4 until the list is empty. 1 4 3 2 Namenode A C L1 L2 B Portion3:2:1 A B C 7 1 4 a 6 8 2 b 5 9 c 3

Sharing Files among Multiple Applications The computing ratio depends on data-intensive applications. Redistribution Redundancy

Experimental Environment Five nodes in a hadoop heterogeneous cluster

Grep and WordCount Grep is a tool searching for a regular expression in a text file WordCount is a program used to count words in a text file

Computing ratio for two applications Computing ratio of the five nodes with respective of Grep and Wordcount applications

Response time of Grep andwordcount in each Node Application dependence Data size independence

Six Data Placement Decisions

Impact of data placement on performance of Grep

Impact of data placement on performance of WordCount

Conclusion • Identify the performance degradation caused by heterogeneity. • Designed and implemented a data placement mechanism in HDFS.

Future Work • Data redundancy issue • Dynamic data distribution mechanism • Prefetching

Thanks

Question ?

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters

Presentation Transcript

SHadoop : Improving MapReduce Performance by Optimizing Job Execution Mechanism in Hadoop Clusters

Hadoop: Beyond MapReduce

MapReduce : Simplified Data Processing on Large Clusters

Tarazu Optimizing MapReduce On Heterogeneous Clusters

MapReduce and Hadoop

Mapreduce and Hadoop

MapReduce : Simplified Data Processing on Large Clusters

Hadoop MapReduce

MapReduce : Simplified Data Processing on Large Clusters

Hadoop MapReduce Programmers perspective

MapReduce : Simplified Data Processing on Large Clusters

Performance of MapReduce on Multicore Clusters

MapReduce: Simplied Data Processing on Large Clusters

MapReduce: Hadoop Implementation

CUDA Performance Study on Hadoop MapReduce Clusters

A Hadoop MapReduce Performance Prediction Method

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments

MapReduce in Hadoop Framework

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka

Improving MapReduce Performance in Heterogeneous Environments

Performance tuning through Hadoop Mapreduce optimization