1 / 26

Statistics-Driven Benchmark for MapReduce

NEXUS. SCADS. 2. NS1. NS2. NS3. WebAppPIQL. WebAppPIQL. WebAppPIQL. Xtrace Chukwa (monitoring). Hadoop HDFS . Hadoop HDFS . Hadoop HDFS . SPARK,SEJITS. KCCA-basedM/R scheduling. logmining. MPI. Batch/Analytics.

Lucy
Download Presentation

Statistics-Driven Benchmark for MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Statistics-Driven Benchmark for MapReduce Yanpei Chen and Archana Ganapathi

    2. 2

    5. We need a Hadoop workload generator Need realistic workload to evaluate Hadoop design choices E.g. configuration, scheduling policies, resource provisioning Current pseudo-benchmarks (gridmix/sort) unsatisfactory Do not capture mix of jobs, data sizes, or inter-job arrival times Cannot hog production cluster for decision-making Hard to reproduce configurations in non-production environment Anonymity concerns limit publication of traces for academia

    6. Design Overview Agnostic to hardware, configuration, and MapReduce implementation Accommodate different cluster sizes and workload durations Mask raw computation done by jobs Solution: Collect sufficient/necessary statistics from real traces Create synthetic workload that mimics real job stream Replay with low overhead

    7. 1. Get statistics from traces Compute CDF of job counts by job name Collect 1st, 25th, 50th, 75th, and 99th percentiles for: Inter-job arrival time Per-job input data size Per-job input: shuffle: output data ratio Scale data size by number of cluster nodes

    8. 2. Sample statistics to get workload Linear extrapolation from percentiles to approximate full distribution i.e. connect the dots between percentile stats Technique suitable to most distributions Probabilistic sampling from approximated distribution to generate [jobName, arrivalTime, inputSize, shuffleInputRatio, outputShuffleRatio] Repeat until we have required number of jobs or duration of workload

    9. 3. Replay Shell Script HDFS RandomWrite(max_input_size) sleep interval[0] RatioMapReduce inputFiles[0] output0 shuffleInputRatio[0] outputShuffleRatio[0] HDFS -rmr output0 & sleep interval[1] RatioMapReduce inputFiles[1] output1 shuffleInputRatio[1] outputShuffleRatio[1] HDFS -rmr output1 & ...

    10. Pros and Cons + Same framework applicable across HW, config, SW, and cluster size + Summary distributions allow anonymized publication of traces – Does not capture the compute part of production jobs Inherent tradeoff to compare workloads with different computations Not a big loss – we will see later

    11. We actually built this!!!

    12. 6 months’ trace is representative in time Extract statistics for 6 months data from job history logs at Facebook Compare distribution to two single week traces for same cluster Traces are representative in time if statistical distributions are similar Gridmix/sort hits only the extremes of statistical distributions

    13. Inter-job arrival times

    14. Inter-job arrival times

    15. Data sizes (input, shuffle, output)

    16. Data sizes (input, shuffle, output)

    17. Workload useful for real problems

    18. Findings are surprising

    19. Findings are surprising

    20. Findings are surprising

    21. Findings are surprising

    22. Findings are surprising

    23. Findings are surprising

    24. Ignoring compute has low cost

    25. Future Work Evaluate on more production datasets Our traces are representative in time Need more traces to evaluate representativeness in type of workloads Performance prediction & scheduling based on generated workload Extrapolation across cluster types and scale Comparison between different types of workloads Workload predictor/oracle?!?

    26. Questions?

More Related