280 likes | 488 Views
2015-1 학기 운영체제특론 Paper Review. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Herodotos Herodotou and Shivnath Babu. 72150272 홍민하. INDEX CONTENT. 04. Cost-based Optimizer (CBO). 03. What-if Engine. 02. Profiler. 01. Introduction.
E N D
2015-1학기 운영체제특론Paper Review Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs HerodotosHerodotouand ShivnathBabu 72150272 홍민하
INDEX CONTENT 04. Cost-based Optimizer (CBO) 03. What-if Engine 02. Profiler 01. Introduction 05. Experimental Evaluation
MapReduce & Hadoop 01. Introduction MapReduce job Configuration parameter settings Cost-based Optimization to Select Configuration Parameter Settings Automatically
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Introduction MapReduce & Hadoop MapReduce is a relatively young framework – both a programming model and an associated run-time system – for large-scale data processing. is a popular open-source implementation of MapReduce. And It is used for applications such as Web indexing, data mining, machine learning, financial analysis, …. Hadoop 2
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Introduction MapReduce job p : a MapReduceprogram d : input data r :cluster resources c : configuration parameter settings 3
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Introduction Configuration parameter settings The number of map tasks in job j. The number of reduce tasks in j. The amount of memory. The settings for multiphase external sorting. Whether the output data should be compressed. Whether a Combiner function should be used. 4
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Introduction Configuration parameter settings Configuration parameters impact on the performance of MapReduce jobs Automating this process would be a critical and timely contribution. The burden falls on the user 5
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Introduction Cost-based Optimization to Select Configuration Parameter Settings Automatically Job j ’s performance : Profiler What-if Engine Cost-based Optimizer responsible for collecting job profiles. heart of our approach to cost-based optimization. in order to find a good configuration setting c. Dataflow Subspace enumeration ↓ Search within each enumerated subspace information regarding the number of bytes and key-value pairs processed Cost estimates Resource usage and execution time 6
Job Profiles 02. Profiler The fields in a profile belong to one of four categories Using Profiles to Analyze Job Behavior Generating Profiles via Measurement
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Profiler Job Profiles Job profile is a vector in which each field captures some unique aspect of dataflow or cost during job execution at the task level or the phase level within tasks. 8
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Profiler The fields in a profile belong to one of four categories Dataflow Cost estimates Number of map tasks in the job Map input bytes Number of spills ⁞ Setup phase time in a task Cleanup phase time in a task Read phase time in the map task ⁞ Cost Statistics Dataflow Statistics Width of input key-value pairs Number of records per reducer’g group Map selectivity in terms of size ⁞ I/O cost for reading from HDFS per byte Cost for network transfers per byte CPU cost for executing the Mapper per record ⁞ 9
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Profiler Using Profiles to Analyze Job Behavior Memory Buffer 10
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Profiler Generating Profiles via Measurement Job Profiles are generated in two distinct ways. By. Profilerfrom scratch by collecting monitoring data during full or partial job execution. By. What-if Enginefrom existing ones using estimation techniques based on modeling and simulation of MapReduce job execution. - Monitoring through dynamic instrumentation - From raw monitoring data to profile fields - Task-level sampling to generate approximate profiles 11
What is the What-if Engine? 03. What-if Engine Estimating the Virtual Profile
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs What-if Engine What is the What-if Engine? What-if question ? What-if Engine executes the following 2 steps to answer a what-if question. Estimating a virtual job profile for the hypothetical job j’. Using the virtual profile to simulate how j’ will execute.(simulate the scheduling and execution of map and reduce tasks in j’) 13
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs What-if Engine What is the What-if Engine? 14
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs What-if Engine Estimating the Virtual Profile - Estimating Dataflow and Cost fields - Estimating Dataflow Statistics fields - Estimating Cost Statistics fields 15
Estimating Dataflow Statistics fields 04. Cost-base Optimizer(CBO) Subspace Enumeration Search Strategy within a Subspace
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Cost-based Optimizer (CBO) Estimating Dataflow Statistics fields MapReduce program optimization can be defined as : The CBO addresses this problem by making what-if calls with settings c of the configuration parameters selected through an enumeration and search over S. What-if Engine needs as input a job profile. case1 : profile is already available. case2 : not available to input. - forgo CBO for the current job execution - used just-in-time mode to generate a job profile 17
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Cost-based Optimizer (CBO) Subspace Enumeration MapReduce program optimization can be defined as : io.sort.mb : only affects the Spill phase in map tasks. mapper.job.shuffle.merge.percent : only affects the Shuffle phase in reduce tasks. Can be optimized independently 18
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Cost-based Optimizer (CBO) Search Strategy within a Subspace Searching within each enumerated subspace to find the optimal configuration in the subspace. Recursive Random Search (RRS) : - Randomly to identify promising regions that contain the optimal setting with high probability. - Recursively in these regions which either move or shrink - Restarts random sampling to find a more promising region to repeat the recursive search. RRS provides probabilistic guarantees on how close the setting it finds is to the optimal setting RRS is fairly robust to deviations of estimated costs from actual performance RRS scales to a large number of dimensions. 19
Rule-based VS. Cost-based Optimization 05. Experimental Evaluation Accuracy of What-if Analysis Approximate Profiles through Sampling Efficiency and Effectiveness of CBO
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Experimental Evaluation Rule-based VS. Cost-based Optimization MapReduce program optimization can be defined as : X2 X2 21
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Experimental Evaluation Accuracy of What-if Analysis 22
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Experimental Evaluation Efficiency and Effectiveness of CBO 23
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Conclusion Machine is better than human. Machine is better than human. 25
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Reference Reference documents [1] HerodotosHerodotouand ShivnathBabu, ‘Profiling, what-if Analysis, and Cost-based Optimization of MapReduce Programs, 2011 [2] Jeffrey Dean and Sanjay Chemawat, MapReduce: Simplified Data Processing on Large Clusters, 2004 [3] A. Abouzeid, K. Bajda-Pawlikowsk, D. Abadi, A. Rasin and A. Silberschatz, ‘HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2:922-933, 2009 [4] Jea-hwa Jung, Get started! Since the basis for Hadoop to YARN for hadoop programming big data analysis, wikibooks, 2015 [5] www.wikipedia.org [6] etc. 24