Herodotos Herodotou Shivnath Babu Duke University

Profiling,What-if Analysis,and Cost-based Optimization of MapReduce Programs Herodotos Herodotou Shivnath Babu Duke University

Abstract • MapReduce has emerged as a viable competitor to database systems in big data analytics. • MapReduce systems lack a feature that has been key to the historicalsuccess of database systems, namely, cost-based optimization. • Introduce the first Cost-based Optimizerfor simple to arbitrarily complex MapReduce programs.

Outline • Introduction • Profiler • What-if engine • Cost-based optimizer • Experimental evaluation • Conclusion

Introduction • MapReduce job J • J = <p, d, r, c> • p: MapReduce program • d: map(k1, v1)reduce(k2, list(v2)) • r: Cluster resources • c: Configuration parameter settings

Introduction • Phase of Map Task Execution • Read, Map, Collect, Spill, Merge • Phase of Reduce Task Execution • Shuffle, Merge, Reduce, Write

Introduction job j = < program p, data d, resources r, configuration c > • Space of configuration choices: • Number of map tasks • Number of reduce tasks • Partitioning of map outputs to reduce tasks • Memory allocation to task-level buffers • Multiphase external sorting in the tasks • Whether output data from tasks should be compressed • Whether combine function should be used

Introduction

Introduction Use defaults or set manually (rules-of-thumb)

Introduction • Cost-based Optimization to Select Configuration Parameter Settings Automatically • perf = F(p, d, r, c) • perf is some performance metric of interest for jobs • Optimizing the performance of program p for given input data d and cluster resources r requires finding configuration parameter settings that give near-optimal values of perf.

Applying Cost-based Optimization Goal Just-in-Time Optimizer Searches through the space S of parameter settings What-if Engine Estimates perf using properties of p, d, r, and c

Serialize, Partition map Memory Buffer Job Profile Sort, [Combine], [Compress] split Merge • Concise representationof program execution as a job • Records information at the level of “task phases” • Generated by Profiler through measurement or by the What-if Enginethrough estimation DFS Read Map Collect Spill Merge

Job Profile Fields

Generating Profiles by Measurement • Dynamic instrumentation • Monitors task phases of MapReduce job execution • Event-condition-action rules are specified, leading to run-time instrumentation of Hadoop internals • We currently use BTrace (Hadoop internals are in Java)

Generating Profiles by Measurement

Profiler • Using Profiles to Analyze Job Behavior

What-if Engine • A what-if question has the following form • Given the profile of a job j =<p; d1; r1; c1i>that runs a MapReduce program p over input data d1 and cluster resources r1 using configuration c1, what will the performance of program p be if p is run over input data d2 and cluster resources r2 using configuration c2? That is, how will job j0 = <p,d2,r2, c2>perform? • The What-if Engine executes the following two steps to answer a what-if question • Estimating a virtual job profile for the hypothetical job j’. • Using the virtual profile to simulate how j’ will execute. We will discuss these steps in turn.

What-if Engine • Estimating Dataflow and Cost fields • detailed set of analytical (white-box) models for estimating the Dataflow and Cost fields in the virtual job profile for j'. • Estimating Dataflow Statistics fields • Dataflow proportionality assumption • Estimating Cost Statistics fields • Cluster node homogeneity assumption • Simulating the Job Execution • Task Scheduler Simulator

What-if Engine

Virtual Profile Estimation • Given profile for job j = <p, d1, r1, c1> • Estimate profile for job j' = <p, d2, r2, c2>

White-box Models • Detailed set of equations for Hadoop • Example: Input data properties Dataflow statistics Configuration parameters Calculate dataflow in each task phase in a map task

Cost-based Optimizer (CBO) • MapReduce program optimization can be defined as • Given a MapReduce program p to be run on input data d and • cluster resources r, find the setting of configuration parameters • for the cost model F represented by the What-if • Engine over the full space S of configuration parameter settings. • The CBO addresses this problem by making what-if calls with settings c of the configuration parameters selected through an enumeration and search over S • . • Once a job profile to input to the What-if Engine is available, the CBO uses a two-step process, discussed next.

Cost-based Optimizer • Subspace Enumeration • More efficient search techniques can be developed if the individual parameters in c can be grouped into clusters. • Equation 2 states that the globally-optimal setting copt can be found using a divide and conquer approach by : • breaking the higher-dimensional space S into the lower-dimensional subspaces S(i) • considering an independent optimization problem in each smaller subspace • composing the optimal parameter settings found per subspace to give the setting copt

Cost-based Optimizer • Search Strategy within a Subspace • searching within each enumerated subspace to find the optimal configuration in the subspace. • Recursive Random Search (RRS) • RRS provides probabilistic guarantees on how close the setting it finds is to the optimal setting • RRS is fairly robust to deviations of estimated costs from actual performance • RRS scales to a large number of dimensions

Just-in-Time Optimizer

Experimental Evaluation

Thank you!

Herodotos Herodotou Shivnath Babu Duke University

Herodotos Herodotou Shivnath Babu Duke University

Presentation Transcript

Duke University Larry Bohs

Duke University Contingency Plan

Duke University Medical Center

Duke University, Durham, NC

Duke University Shanghai Jiaotong University

Mike West Duke University

Yu Zeng Duke University

Duke University

Shivnath Babu

Duke University

Vincent Conitzer Duke University

Pradeep Kumar Gunda (Thanks to Jigar Doshi and Shivnath Babu for some slides)

Shivnath Babu

Duke University, Duke University Medical Center and Duke University Health System

Chris Hays Duke University

Herodotos Halikarnesseus

Vincent Conitzer Duke University

Shivnath Babu