290 likes | 395 Views
A Hadoop MapReduce Performance Prediction Method. Ge Song * + , Zide Meng * , Fabrice Huet * , Frederic Magoules + , Lei Yu # and Xuelian Lin # * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China. Background.
E N D
A Hadoop MapReduce Performance Prediction Method GeSong*+,ZideMeng*,FabriceHuet*,FredericMagoules+,LeiYu#andXuelianLin# * UniversityofNiceSophiaAntipolis,CNRS,I3S,UMR7271,France +EcoleCentraledeParis,France #BeihangUniversity,BeijingChina
Background • HadoopMapReduce Job Map Map Map Map Map Reduce Reduce Reduce (Key, Value) Partion1 Partion2 + I N P U T D A T A Map Map Reduce Map Split Reduce Map HDFS
Background • Hadoop • ManystepswithinMap stage and Reduce stage • Differentstepmayconsumedifferent type of resource Map R E A D Map S O R T M E R G E O U T P U T
Motivation • Problems Scheduling • Noconsiderationabouttheexecutiontimeanddifferenttypeofresourcesconsumed CPU Intensive Hadoop Hadoop CPU Intensive Hadoop Parameter Tuning • Numerousparameters,defaultvalueisnotoptimal Job Hadoop DefaultHadoop Job Default Conf
Motivation • Solution Scheduling • Noconsiderationabouttheexecutiontimeanddifferenttypeofresourcesconsumed • PredicttheperformanceofHadoopJobs Hadoop Parameter Tuning • Numerousparameters,defaultvalueisnotoptimal
RelatedWork • ExistingPredictionMethod1:-BlackBoxBased Hadoop LackoftheanalysisaboutHadoop Hardtochoose JobFeatures Statistic/Learning Models Execution Time
RelatedWork • ExistingPredictionMethod2:-CostModelBased Hadoop Hadoop Read map … Out put Read … reduce Output Lotsofconcurrentprocesses Hardtodividestages Difficulttoensureaccuracy F(map)=f(read,map,sort,spill,merge,write) F(reduce)=f(read,write,merge,reduce,write) ExecutionTime JobFeature
RelatedWork • ABriefSummaryaboutExistingPredictionMethod • Simpleprediction, • Lack of jobs (jar package + data) analysis
Goal • Design a Hadoop MapReduce performance prediction system to:- Predict the job consumption of various type of resources (CPU, Disk IO, Network)- Predict the execution time of Map phase and Reduce phase Prediction System Job - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time
Design - 1 • Cost Model C O S T M O D E L Job - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time
Cost Model [1] • Analysis about Map- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources CPU: Disk: Net: Map Initiation Read Data Sort In Memory Merge Sort Serialization Network Transfer Create Object Read/Write Disk Write Disk Map Function [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoopmapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.
Cost Model [1] • CostFunctionParameters Analysis • Type One:Constant • HadoopSystem Consume,Initialization Consume • Type Two:Job-related Parameters • Map Function Computational Complexity,Map Input Records • Type Three:Parameters defined by Cost Model • Sorting Coefficient, Complexity Factor [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoopmapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.
Parameters Collection • Type One and Type Three • Type one: Run empty map tasks,calculate the system consumedfromthe logs • Type Three: Extract the sort part from Hadoop source code, sort a certain number of records. • Type Two • Run a new job,analyzelog • HighLatency • LargeOverhead • SamplingData,only analyze the behavior of map function and reduce function • Almost no latency • Very low extra overhead JobAnalyzer
JobAnalyzer-Implementation • Job Analyzer – Implementation • Hadoop virtual execution environment • Accept the job Jar File & Input Data • Sampling Module • Sample input data by a certain percentage (less than 5%). • MRModule • Instantiate user job’s class in using Java reflection • Analyze Module • Input Data (Amount & Number) • Relative computational complexity • Data conversion rates (output/input) JarFile + Input Data Hadoopvirtualexecution environment MR Module Sampling Module Analyze Module Job Feature
Job Analyzer - Feasibility • Datasimilarity:Logshaveuniformformat • Execution similarity:each record will be processed by the same map & reduce function repeatedly Map I N P U T D A T A Map Reduce Map Split Reduce Map
Design - 2 • Parameters Collection Job Analyzer: Collect Parameters of Type 2 C O S T M O D E L - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3
Prediction Model • Problem Analysis-Many concurrent steps -- the total time can not be added up by the time of each part CPU: Disk: Net: Initiation Read Data Sort In Memory Merge Sort Serialization Network Transfer Create Object Read/Write Disk Write Disk Map Function
Prediction Model • Main Factors (according to the performance model)- Map Stage Tmap=α0 +α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data The amount of input data The number of input records (N) NlogN The complexity of Map function Initiation Read Data Sort In Memory Merge Sort Serialization The conversion rate of Map data Network Transfer Create Object Read/Write Disk Write Disk Map Function
Prediction Model • Experimental Analysis • Test 4 kinds of jobs (0-10000 records) • Extract the features for linear regression • Calculate the correlation coefficient (R2)
Prediction Model ExecutionTimeofMap • Very good linearrelationshipwithin the samekind of jobs. • But no linearrelationshipamongdifferentkind of jobs. NumberofRecords
Find the nearest jobs! • Instance-Based Linear Regression • Findthenearestsamplestothejobsto be predicted in history logs • “nearest”-> similar jobs (Top K nearest, with K=10%-15%) • Do linear regression to the samples wehavefound • Calculatethepredictionvalue • Nearest: • Theweighteddistanceofjobfeatures (weight w) • Highcontributionforjobclassification: • map/reducecomplexity,map/reducedataconversionrate • Lowcontributionforjobclassification: • Dataamount、Numberofrecords
Prediction Module • Procedure Job Features 3 Search for the nearest samples 4 Cost Model Main Factors Tmap=α0+α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data 6 5 1 2 Prediction Function 7 Prediction Results
Prediction Module • Procedure Cost Model Find-NeighborModule PredictionFunction Training Set PredictionResults
Design - 3 • Parameters Collection Job Analyzer: Collect Parameters of Type 2 C O S T M O D E L Prediction Module - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3
Experience • TaskExecutionTime(ErrorRate) • K=12%, and with w different for each feature • K=12%, and with w the same for each feature • K=25%, and with w different for each feature • 4 kinds of jobs, 64M-8G JobID JobID
Conclusion • Job Analyzer : • Analyze Job Jar + Input File • Collect parameters • Prediction Module: • Find the main factor • Propose a linear equation • Job classification • Multiple prediction
Thank you! Question?
Cost Model [1] • Analysis about Reduce- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources CPU: Disk: Net: Reduce CreateObject Read Data Merge Sort Initiation Reduce Function Read/Write Disk Write Disk Network Transfer Serialization Deserialization Network
Prediction Model • Main Factors (according to the performance model)- Reduce Stage Treduce=β0 +β1*MapInput +β2*N +β3*Nlog(N) +β4*The complexity of Reduce function +β5*The conversion rate of Map data +β6*The conversion rate of Reduce data The amount of input data The number of input records NlogN The complexity of Reduce function CreateObject Read Data Merge Sort Initiation Reduce Function The conversion rate of Map data Read/Write Disk The conversion rate of Reduce data Write Disk Network Transfer Serialization Deserialization Network