A Hadoop MapReduce Performance Prediction Method

A Hadoop MapReduce Performance Prediction Method GeSong*+,ZideMeng*,FabriceHuet*,FredericMagoules+,LeiYu#andXuelianLin# * UniversityofNiceSophiaAntipolis,CNRS,I3S,UMR7271,France +EcoleCentraledeParis,France #BeihangUniversity,BeijingChina

Background • HadoopMapReduce Job Map Map Map Map Map Reduce Reduce Reduce (Key, Value) Partion1 Partion2 + I N P U T D A T A Map Map Reduce Map Split Reduce Map HDFS

Background • Hadoop • ManystepswithinMap stage and Reduce stage • Differentstepmayconsumedifferent type of resource Map R E A D Map S O R T M E R G E O U T P U T

Motivation • Problems Scheduling • Noconsiderationabouttheexecutiontimeanddifferenttypeofresourcesconsumed CPU Intensive Hadoop Hadoop CPU Intensive Hadoop Parameter Tuning • Numerousparameters,defaultvalueisnotoptimal Job Hadoop DefaultHadoop Job Default Conf

Motivation • Solution Scheduling • Noconsiderationabouttheexecutiontimeanddifferenttypeofresourcesconsumed • PredicttheperformanceofHadoopJobs Hadoop Parameter Tuning • Numerousparameters,defaultvalueisnotoptimal

RelatedWork • ExistingPredictionMethod1：-BlackBoxBased Hadoop LackoftheanalysisaboutHadoop Hardtochoose JobFeatures Statistic/Learning Models Execution Time

RelatedWork • ExistingPredictionMethod2：-CostModelBased Hadoop Hadoop Read map … Out put Read … reduce Output Lotsofconcurrentprocesses Hardtodividestages Difficulttoensureaccuracy F(map)=f(read,map,sort,spill,merge,write) F(reduce)=f(read,write,merge,reduce,write) ExecutionTime JobFeature

RelatedWork • ABriefSummaryaboutExistingPredictionMethod • Simpleprediction, • Lack of jobs (jar package + data) analysis

Goal • Design a Hadoop MapReduce performance prediction system to:- Predict the job consumption of various type of resources (CPU, Disk IO, Network)- Predict the execution time of Map phase and Reduce phase Prediction System Job - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time

Design - 1 • Cost Model C O S T M O D E L Job - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time

Cost Model [1] • Analysis about Map- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources CPU: Disk: Net: Map Initiation Read Data Sort In Memory Merge Sort Serialization Network Transfer Create Object Read/Write Disk Write Disk Map Function [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoopmapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

Cost Model [1] • CostFunctionParameters Analysis • Type One：Constant • HadoopSystem Consume，Initialization Consume • Type Two：Job-related Parameters • Map Function Computational Complexity，Map Input Records • Type Three：Parameters defined by Cost Model • Sorting Coefficient, Complexity Factor [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoopmapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

Parameters Collection • Type One and Type Three • Type one: Run empty map tasks，calculate the system consumedfromthe logs • Type Three: Extract the sort part from Hadoop source code, sort a certain number of records. • Type Two • Run a new job，analyzelog • HighLatency • LargeOverhead • SamplingData，only analyze the behavior of map function and reduce function • Almost no latency • Very low extra overhead JobAnalyzer

JobAnalyzer-Implementation • Job Analyzer – Implementation • Hadoop virtual execution environment • Accept the job Jar File & Input Data • Sampling Module • Sample input data by a certain percentage (less than 5%). • MRModule • Instantiate user job’s class in using Java reflection • Analyze Module • Input Data (Amount & Number) • Relative computational complexity • Data conversion rates (output/input) JarFile + Input Data Hadoopvirtualexecution environment MR Module Sampling Module Analyze Module Job Feature

Job Analyzer - Feasibility • Datasimilarity:Logshaveuniformformat • Execution similarity:each record will be processed by the same map & reduce function repeatedly Map I N P U T D A T A Map Reduce Map Split Reduce Map

Design - 2 • Parameters Collection Job Analyzer: Collect Parameters of Type 2 C O S T M O D E L - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

Prediction Model • Problem Analysis-Many concurrent steps -- the total time can not be added up by the time of each part CPU: Disk: Net: Initiation Read Data Sort In Memory Merge Sort Serialization Network Transfer Create Object Read/Write Disk Write Disk Map Function

Prediction Model • Main Factors (according to the performance model)- Map Stage Tmap=α0 +α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data The amount of input data The number of input records (N) NlogN The complexity of Map function Initiation Read Data Sort In Memory Merge Sort Serialization The conversion rate of Map data Network Transfer Create Object Read/Write Disk Write Disk Map Function

Prediction Model • Experimental Analysis • Test 4 kinds of jobs (0-10000 records) • Extract the features for linear regression • Calculate the correlation coefficient (R2)

Prediction Model ExecutionTimeofMap • Very good linearrelationshipwithin the samekind of jobs. • But no linearrelationshipamongdifferentkind of jobs. NumberofRecords

Find the nearest jobs! • Instance-Based Linear Regression • Findthenearestsamplestothejobsto be predicted in history logs • “nearest”-> similar jobs (Top K nearest, with K=10%-15%) • Do linear regression to the samples wehavefound • Calculatethepredictionvalue • Nearest： • Theweighteddistanceofjobfeatures (weight w) • Highcontributionforjobclassification： • map/reducecomplexity，map/reducedataconversionrate • Lowcontributionforjobclassification： • Dataamount、Numberofrecords

Prediction Module • Procedure Job Features 3 Search for the nearest samples 4 Cost Model Main Factors Tmap=α0+α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data 6 5 1 2 Prediction Function 7 Prediction Results

Prediction Module • Procedure Cost Model Find-NeighborModule PredictionFunction Training Set PredictionResults

Design - 3 • Parameters Collection Job Analyzer: Collect Parameters of Type 2 C O S T M O D E L Prediction Module - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

Experience • TaskExecutionTime(ErrorRate) • K=12%, and with w different for each feature • K=12%, and with w the same for each feature • K=25%, and with w different for each feature • 4 kinds of jobs, 64M-8G JobID JobID

Conclusion • Job Analyzer : • Analyze Job Jar + Input File • Collect parameters • Prediction Module: • Find the main factor • Propose a linear equation • Job classification • Multiple prediction

Thank you! Question?

Cost Model [1] • Analysis about Reduce- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources CPU: Disk: Net: Reduce CreateObject Read Data Merge Sort Initiation Reduce Function Read/Write Disk Write Disk Network Transfer Serialization Deserialization Network

Prediction Model • Main Factors (according to the performance model)- Reduce Stage Treduce=β0 +β1*MapInput +β2*N +β3*Nlog(N) +β4*The complexity of Reduce function +β5*The conversion rate of Map data +β6*The conversion rate of Reduce data The amount of input data The number of input records NlogN The complexity of Reduce function CreateObject Read Data Merge Sort Initiation Reduce Function The conversion rate of Map data Read/Write Disk The conversion rate of Reduce data Write Disk Network Transfer Serialization Deserialization Network

A Hadoop MapReduce Performance Prediction Method

A Hadoop MapReduce Performance Prediction Method

Presentation Transcript

ETL with Hadoop and MapReduce

Introduction to Hadoop and MapReduce

Hadoop: Beyond MapReduce

Introduction to MapReduce and Hadoop

MapReduce and Hadoop

Mapreduce and Hadoop

Hadoop MapReduce

Introduction to MapReduce and Hadoop

Hadoop MapReduce Programmers perspective

Cloud Computing with MapReduce and Hadoop

MapReduce: Hadoop Implementation

CUDA Performance Study on Hadoop MapReduce Clusters

MapReduce and Hadoop Distributed File System

Development Environment Of Hadoop MapReduce | Hadoop Online Training

MapReduce in Hadoop Framework

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters

Performance tuning through Hadoop Mapreduce optimization