270 likes | 450 Views
Profiling,What-if Analysis,and Cost-based Optimization of MapReduce Programs. Herodotos Herodotou Shivnath Babu Duke University. Abstract. MapReduce has emerged as a viable competitor to database systems in big data analytics.
E N D
Profiling,What-if Analysis,and Cost-based Optimization of MapReduce Programs Herodotos Herodotou Shivnath Babu Duke University
Abstract • MapReduce has emerged as a viable competitor to database systems in big data analytics. • MapReduce systems lack a feature that has been key to the historicalsuccess of database systems, namely, cost-based optimization. • Introduce the first Cost-based Optimizerfor simple to arbitrarily complex MapReduce programs.
Outline • Introduction • Profiler • What-if engine • Cost-based optimizer • Experimental evaluation • Conclusion
Introduction • MapReduce job J • J = <p, d, r, c> • p: MapReduce program • d: map(k1, v1)reduce(k2, list(v2)) • r: Cluster resources • c: Configuration parameter settings
Introduction • Phase of Map Task Execution • Read, Map, Collect, Spill, Merge • Phase of Reduce Task Execution • Shuffle, Merge, Reduce, Write
Introduction job j = < program p, data d, resources r, configuration c > • Space of configuration choices: • Number of map tasks • Number of reduce tasks • Partitioning of map outputs to reduce tasks • Memory allocation to task-level buffers • Multiphase external sorting in the tasks • Whether output data from tasks should be compressed • Whether combine function should be used
Introduction Use defaults or set manually (rules-of-thumb)
Introduction • Cost-based Optimization to Select Configuration Parameter Settings Automatically • perf = F(p, d, r, c) • perf is some performance metric of interest for jobs • Optimizing the performance of program p for given input data d and cluster resources r requires finding configuration parameter settings that give near-optimal values of perf.
Applying Cost-based Optimization Goal Just-in-Time Optimizer Searches through the space S of parameter settings What-if Engine Estimates perf using properties of p, d, r, and c
Serialize, Partition map Memory Buffer Job Profile Sort, [Combine], [Compress] split Merge • Concise representationof program execution as a job • Records information at the level of “task phases” • Generated by Profiler through measurement or by the What-if Enginethrough estimation DFS Read Map Collect Spill Merge
Generating Profiles by Measurement • Dynamic instrumentation • Monitors task phases of MapReduce job execution • Event-condition-action rules are specified, leading to run-time instrumentation of Hadoop internals • We currently use BTrace (Hadoop internals are in Java)
Profiler • Using Profiles to Analyze Job Behavior
What-if Engine • A what-if question has the following form • Given the profile of a job j =<p; d1; r1; c1i>that runs a MapReduce program p over input data d1 and cluster resources r1 using configuration c1, what will the performance of program p be if p is run over input data d2 and cluster resources r2 using configuration c2? That is, how will job j0 = <p,d2,r2, c2>perform? • The What-if Engine executes the following two steps to answer a what-if question • Estimating a virtual job profile for the hypothetical job j’. • Using the virtual profile to simulate how j’ will execute. We will discuss these steps in turn.
What-if Engine • Estimating Dataflow and Cost fields • detailed set of analytical (white-box) models for estimating the Dataflow and Cost fields in the virtual job profile for j'. • Estimating Dataflow Statistics fields • Dataflow proportionality assumption • Estimating Cost Statistics fields • Cluster node homogeneity assumption • Simulating the Job Execution • Task Scheduler Simulator
Virtual Profile Estimation • Given profile for job j = <p, d1, r1, c1> • Estimate profile for job j' = <p, d2, r2, c2>
White-box Models • Detailed set of equations for Hadoop • Example: Input data properties Dataflow statistics Configuration parameters Calculate dataflow in each task phase in a map task
Cost-based Optimizer (CBO) • MapReduce program optimization can be defined as • Given a MapReduce program p to be run on input data d and • cluster resources r, find the setting of configuration parameters • for the cost model F represented by the What-if • Engine over the full space S of configuration parameter settings. • The CBO addresses this problem by making what-if calls with settings c of the configuration parameters selected through an enumeration and search over S • . • Once a job profile to input to the What-if Engine is available, the CBO uses a two-step process, discussed next.
Cost-based Optimizer • Subspace Enumeration • More efficient search techniques can be developed if the individual parameters in c can be grouped into clusters. • Equation 2 states that the globally-optimal setting copt can be found using a divide and conquer approach by : • breaking the higher-dimensional space S into the lower-dimensional subspaces S(i) • considering an independent optimization problem in each smaller subspace • composing the optimal parameter settings found per subspace to give the setting copt
Cost-based Optimizer • Search Strategy within a Subspace • searching within each enumerated subspace to find the optimal configuration in the subspace. • Recursive Random Search (RRS) • RRS provides probabilistic guarantees on how close the setting it finds is to the optimal setting • RRS is fairly robust to deviations of estimated costs from actual performance • RRS scales to a large number of dimensions