Statistics-Driven Benchmark for MapReduce

1. Statistics-Driven Benchmark for MapReduce Yanpei Chen and Archana Ganapathi

2. 2

5. We need a Hadoop workload generator Need realistic workload to evaluate Hadoop design choices E.g. configuration, scheduling policies, resource provisioning Current pseudo-benchmarks (gridmix/sort) unsatisfactory Do not capture mix of jobs, data sizes, or inter-job arrival times Cannot hog production cluster for decision-making Hard to reproduce configurations in non-production environment Anonymity concerns limit publication of traces for academia

6. Design Overview Agnostic to hardware, configuration, and MapReduce implementation Accommodate different cluster sizes and workload durations Mask raw computation done by jobs Solution: Collect sufficient/necessary statistics from real traces Create synthetic workload that mimics real job stream Replay with low overhead

7. 1. Get statistics from traces Compute CDF of job counts by job name Collect 1st, 25th, 50th, 75th, and 99th percentiles for: Inter-job arrival time Per-job input data size Per-job input: shuffle: output data ratio Scale data size by number of cluster nodes

8. 2. Sample statistics to get workload Linear extrapolation from percentiles to approximate full distribution i.e. connect the dots between percentile stats Technique suitable to most distributions Probabilistic sampling from approximated distribution to generate [jobName, arrivalTime, inputSize, shuffleInputRatio, outputShuffleRatio] Repeat until we have required number of jobs or duration of workload

9. 3. Replay Shell Script HDFS RandomWrite(max_input_size) sleep interval[0] RatioMapReduce inputFiles[0] output0 shuffleInputRatio[0] outputShuffleRatio[0] HDFS -rmr output0 & sleep interval[1] RatioMapReduce inputFiles[1] output1 shuffleInputRatio[1] outputShuffleRatio[1] HDFS -rmr output1 & ...

10. Pros and Cons + Same framework applicable across HW, config, SW, and cluster size + Summary distributions allow anonymized publication of traces � Does not capture the compute part of production jobs Inherent tradeoff to compare workloads with different computations Not a big loss � we will see later

11. We actually built this!!!

12. 6 months� trace is representative in time Extract statistics for 6 months data from job history logs at Facebook Compare distribution to two single week traces for same cluster Traces are representative in time if statistical distributions are similar Gridmix/sort hits only the extremes of statistical distributions

13. Inter-job arrival times

14. Inter-job arrival times

15. Data sizes (input, shuffle, output)

16. Data sizes (input, shuffle, output)

17. Workload useful for real problems

18. Findings are surprising






24. Ignoring compute has low cost

25. Future Work Evaluate on more production datasets Our traces are representative in time Need more traces to evaluate representativeness in type of workloads Performance prediction & scheduling based on generated workload Extrapolation across cluster types and scale Comparison between different types of workloads Workload predictor/oracle?!?

26. Questions?

Statistics-Driven Benchmark for MapReduce

Statistics-Driven Benchmark for MapReduce

Presentation Transcript

MapReduce

MapReduce

MapReduce

MapReduce for Repy

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

Benchmark Statistics Service (BEST)

MapReduce

MapReduce

Big Data Driven: Official Statistics

MapReduce

MapReduce

Job Scheduling for MapReduce

MapReduce

MapReduce