260 likes | 679 Views
NEXUS. SCADS. 2. NS1. NS2. NS3. WebAppPIQL. WebAppPIQL. WebAppPIQL. Xtrace Chukwa (monitoring). Hadoop HDFS . Hadoop HDFS . Hadoop HDFS . SPARK,SEJITS. KCCA-basedM/R scheduling. logmining. MPI. Batch/Analytics.
E N D
1. Statistics-Driven Benchmark for MapReduce Yanpei Chen and Archana Ganapathi
2. 2
5. We need a Hadoop workload generator Need realistic workload to evaluate Hadoop design choices
E.g. configuration, scheduling policies, resource provisioning
Current pseudo-benchmarks (gridmix/sort) unsatisfactory
Do not capture mix of jobs, data sizes, or inter-job arrival times
Cannot hog production cluster for decision-making
Hard to reproduce configurations in non-production environment
Anonymity concerns limit publication of traces for academia
6. Design Overview Agnostic to hardware, configuration, and MapReduce implementation
Accommodate different cluster sizes and workload durations
Mask raw computation done by jobs
Solution:
Collect sufficient/necessary statistics from real traces
Create synthetic workload that mimics real job stream
Replay with low overhead
7. 1. Get statistics from traces Compute CDF of job counts by job name
Collect 1st, 25th, 50th, 75th, and 99th percentiles for:
Inter-job arrival time
Per-job input data size
Per-job input: shuffle: output data ratio
Scale data size by number of cluster nodes
8. 2. Sample statistics to get workload Linear extrapolation from percentiles to approximate full distribution
i.e. connect the dots between percentile stats
Technique suitable to most distributions
Probabilistic sampling from approximated distribution to generate
[jobName, arrivalTime,
inputSize,
shuffleInputRatio,
outputShuffleRatio]
Repeat until we have required number of jobs or duration of workload
9. 3. Replay Shell Script HDFS RandomWrite(max_input_size)
sleep interval[0]
RatioMapReduce inputFiles[0] output0 shuffleInputRatio[0] outputShuffleRatio[0]
HDFS -rmr output0 &
sleep interval[1]
RatioMapReduce inputFiles[1] output1 shuffleInputRatio[1] outputShuffleRatio[1]
HDFS -rmr output1 &
...
10. Pros and Cons + Same framework applicable across HW, config, SW, and cluster size
+ Summary distributions allow anonymized publication of traces
– Does not capture the compute part of production jobs
Inherent tradeoff to compare workloads with different computations
Not a big loss – we will see later
11. We actually built this!!!
12. 6 months’ trace is representative in time
Extract statistics for 6 months data from job history logs at Facebook
Compare distribution to two single week traces for same cluster
Traces are representative in time if statistical distributions are similar
Gridmix/sort hits only the extremes of statistical distributions
13. Inter-job arrival times
14. Inter-job arrival times
15. Data sizes (input, shuffle, output)
16. Data sizes (input, shuffle, output)
17. Workload useful for real problems
18. Findings are surprising
19. Findings are surprising
20. Findings are surprising
21. Findings are surprising
22. Findings are surprising
23. Findings are surprising
24. Ignoring compute has low cost
25. Future Work Evaluate on more production datasets
Our traces are representative in time
Need more traces to evaluate representativeness in type of workloads
Performance prediction & scheduling based on generated workload
Extrapolation across cluster types and scale
Comparison between different types of workloads
Workload predictor/oracle?!?
26. Questions?