1 / 29

A Hadoop MapReduce Performance Prediction Method

A Hadoop MapReduce Performance Prediction Method. Ge Song * + , Zide Meng * , Fabrice Huet * , Frederic Magoules + , Lei Yu # and Xuelian Lin # * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China. Background.

Download Presentation

A Hadoop MapReduce Performance Prediction Method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Hadoop MapReduce Performance Prediction Method GeSong*+,ZideMeng*,FabriceHuet*,FredericMagoules+,LeiYu#andXuelianLin# * UniversityofNiceSophiaAntipolis,CNRS,I3S,UMR7271,France +EcoleCentraledeParis,France #BeihangUniversity,BeijingChina

  2. Background • HadoopMapReduce Job Map Map Map Map Map Reduce Reduce Reduce (Key, Value) Partion1 Partion2 + I N P U T D A T A Map Map Reduce Map Split Reduce Map HDFS

  3. Background • Hadoop • ManystepswithinMap stage and Reduce stage • Differentstepmayconsumedifferent type of resource Map R E A D Map S O R T M E R G E O U T P U T

  4. Motivation • Problems Scheduling • Noconsiderationabouttheexecutiontimeanddifferenttypeofresourcesconsumed CPU Intensive Hadoop Hadoop CPU Intensive Hadoop Parameter Tuning • Numerousparameters,defaultvalueisnotoptimal Job Hadoop DefaultHadoop Job Default Conf

  5. Motivation • Solution Scheduling • Noconsiderationabouttheexecutiontimeanddifferenttypeofresourcesconsumed • PredicttheperformanceofHadoopJobs Hadoop Parameter Tuning • Numerousparameters,defaultvalueisnotoptimal

  6. RelatedWork • ExistingPredictionMethod1:-BlackBoxBased Hadoop LackoftheanalysisaboutHadoop Hardtochoose JobFeatures Statistic/Learning Models Execution Time

  7. RelatedWork • ExistingPredictionMethod2:-CostModelBased Hadoop Hadoop Read map … Out put Read … reduce Output Lotsofconcurrentprocesses Hardtodividestages Difficulttoensureaccuracy F(map)=f(read,map,sort,spill,merge,write) F(reduce)=f(read,write,merge,reduce,write) ExecutionTime JobFeature

  8. RelatedWork • ABriefSummaryaboutExistingPredictionMethod • Simpleprediction, • Lack of jobs (jar package + data) analysis

  9. Goal • Design a Hadoop MapReduce performance prediction system to:- Predict the job consumption of various type of resources (CPU, Disk IO, Network)- Predict the execution time of Map phase and Reduce phase Prediction System Job - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time

  10. Design - 1 • Cost Model C O S T M O D E L Job - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time

  11. Cost Model [1] • Analysis about Map- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources CPU: Disk: Net: Map Initiation Read Data Sort In Memory Merge Sort Serialization Network Transfer Create Object Read/Write Disk Write Disk Map Function [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoopmapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

  12. Cost Model [1] • CostFunctionParameters Analysis • Type One:Constant • HadoopSystem Consume,Initialization Consume • Type Two:Job-related Parameters • Map Function Computational Complexity,Map Input Records • Type Three:Parameters defined by Cost Model • Sorting Coefficient, Complexity Factor [1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoopmapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.

  13. Parameters Collection • Type One and Type Three • Type one: Run empty map tasks,calculate the system consumedfromthe logs • Type Three: Extract the sort part from Hadoop source code, sort a certain number of records. • Type Two • Run a new job,analyzelog • HighLatency • LargeOverhead • SamplingData,only analyze the behavior of map function and reduce function • Almost no latency • Very low extra overhead JobAnalyzer

  14. JobAnalyzer-Implementation • Job Analyzer – Implementation • Hadoop virtual execution environment • Accept the job Jar File & Input Data • Sampling Module • Sample input data by a certain percentage (less than 5%). • MRModule • Instantiate user job’s class in using Java reflection • Analyze Module • Input Data (Amount & Number) • Relative computational complexity • Data conversion rates (output/input) JarFile + Input Data Hadoopvirtualexecution environment MR Module Sampling Module Analyze Module Job Feature

  15. Job Analyzer - Feasibility • Datasimilarity:Logshaveuniformformat • Execution similarity:each record will be processed by the same map & reduce function repeatedly Map I N P U T D A T A Map Reduce Map Split Reduce Map

  16. Design - 2 • Parameters Collection Job Analyzer: Collect Parameters of Type 2 C O S T M O D E L - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

  17. Prediction Model • Problem Analysis-Many concurrent steps -- the total time can not be added up by the time of each part CPU: Disk: Net: Initiation Read Data Sort In Memory Merge Sort Serialization Network Transfer Create Object Read/Write Disk Write Disk Map Function

  18. Prediction Model • Main Factors (according to the performance model)- Map Stage Tmap=α0 +α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data The amount of input data The number of input records (N) NlogN The complexity of Map function Initiation Read Data Sort In Memory Merge Sort Serialization The conversion rate of Map data Network Transfer Create Object Read/Write Disk Write Disk Map Function

  19. Prediction Model • Experimental Analysis • Test 4 kinds of jobs (0-10000 records) • Extract the features for linear regression • Calculate the correlation coefficient (R2)

  20. Prediction Model ExecutionTimeofMap • Very good linearrelationshipwithin the samekind of jobs. • But no linearrelationshipamongdifferentkind of jobs. NumberofRecords

  21. Find the nearest jobs! • Instance-Based Linear Regression • Findthenearestsamplestothejobsto be predicted in history logs • “nearest”-> similar jobs (Top K nearest, with K=10%-15%) • Do linear regression to the samples wehavefound • Calculatethepredictionvalue • Nearest: • Theweighteddistanceofjobfeatures (weight w) • Highcontributionforjobclassification: • map/reducecomplexity,map/reducedataconversionrate • Lowcontributionforjobclassification: • Dataamount、Numberofrecords

  22. Prediction Module • Procedure Job Features 3 Search for the nearest samples 4 Cost Model Main Factors Tmap=α0+α1*MapInput +α2*N +α3*N*Log(N) +α4*The complexity of map function +α5*The conversion rate of map data 6 5 1 2 Prediction Function 7 Prediction Results

  23. Prediction Module • Procedure Cost Model Find-NeighborModule PredictionFunction Training Set PredictionResults

  24. Design - 3 • Parameters Collection Job Analyzer: Collect Parameters of Type 2 C O S T M O D E L Prediction Module - Map execution time - Reduce execution time - CPUOccupationTime - Disk Occupation Time - Network Occupation Time Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

  25. Experience • TaskExecutionTime(ErrorRate) • K=12%, and with w different for each feature • K=12%, and with w the same for each feature • K=25%, and with w different for each feature • 4 kinds of jobs, 64M-8G JobID JobID

  26. Conclusion • Job Analyzer : • Analyze Job Jar + Input File • Collect parameters • Prediction Module: • Find the main factor • Propose a linear equation • Job classification • Multiple prediction

  27. Thank you! Question?

  28. Cost Model [1] • Analysis about Reduce- Modeling the resources (CPU Disk Network) consumption- Each stage involves only one type of resources CPU: Disk: Net: Reduce CreateObject Read Data Merge Sort Initiation Reduce Function Read/Write Disk Write Disk Network Transfer Serialization Deserialization Network

  29. Prediction Model • Main Factors (according to the performance model)- Reduce Stage Treduce=β0 +β1*MapInput +β2*N +β3*Nlog(N) +β4*The complexity of Reduce function +β5*The conversion rate of Map data +β6*The conversion rate of Reduce data The amount of input data The number of input records NlogN The complexity of Reduce function CreateObject Read Data Merge Sort Initiation Reduce Function The conversion rate of Map data Read/Write Disk The conversion rate of Reduce data Write Disk Network Transfer Serialization Deserialization Network

More Related