1 / 55

Rigorous Benchmarking in Reasonable Time

Rigorous Benchmarking in Reasonable Time. Tomas Kalibera , Richard Jones University of Kent. By comparing an old and a new system rigorously, find If there is a performance change ? How large is the change ? What variation we expect? How confident are we of the result?

said
Download Presentation

Rigorous Benchmarking in Reasonable Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rigorous Benchmarking inReasonable Time Tomas Kalibera, Richard Jones University of Kent

  2. By comparing an old and a new system rigorously, find If there is a performance change? How large is the change? What variation we expect? How confident are we of the result? How many experiments must we carry out? What do we want to establish? new execution timeold execution time

  3. Computer systems are complex. • Many factors influence performance: • Some known. • Some out of experimenter’s control. • Some non-deterministic. • Execution times vary. • We need to design experiments and summarise results in a repeatable and reproducible fashion. Uncertainty

  4. Uncertainty should be reported! Papers published in 2011

  5. Uncertainty should be reported! 70% ignoreduncertainty Papers published in 2011

  6. Not always obvious if experiments were repeated. • Very few report that experiments repeat at more than one level, e.g. • Repeat executions (e.g. invocations of a JVM). • Repeat measurements (e.g. iterations of an application). • Number of repetitions: arbitrary or heuristic-based? How were the experiments performed?

  7. Good experimental methods take time One benchmark…

  8. Good experimental methods take time A suite…

  9. Good experimental methods take time Add invocations…

  10. Good experimental methods take time and iterations…

  11. Good experimental methods take time …and heap sizes

  12. Is statistically rigorous experimental methodology simply infeasible? A lost cause?

  13. With some initial one-off investment, • We can cater for variation • Without excessive repetition (in most cases). • Our contributions: • A sound experimental methodology that makes best use of experiment time. • How to establish how much repetition is needed. • How to estimate error bounds . NO!

  14. Variation at several stages of a benchmark experiment — iteration, execution, compilation… • Controlled variables platform, heap size or compiler options. • Random variables statistical properties. • Uncontrolled variables try to convert these to controlled or randomised (e.g. by randomising link order). • The challenge: • How to design efficient experiments given the random variables present, and • Summarise the results, with a confidence interval. The Challenge of Reasonable Repetition

  15. An experiment with 3 “levels” (though our technique is general): Repeat compilation to create a binary— e.g. if code performance depends on layout. Repeat executions of the same binary. Repeat iterations of a benchmark. Our running example

  16. Researchers are typically interested in steady state performance. Initialised state: no significant initialisation overhead. Independent state: iteration times are (statistically) independent and identically distributed (IID). Don’t repeat measurements before independence. If measurements are not IID, the variance and confidence interval estimates will be biased. Independent state

  17. Does a benchmark reach an independent state?After how many iterations? • DaCapo/OpenJDK 7: ‘large’ and ‘small’ sizes3 executions, 300 iterations/execution. • Inspect run-sequence, lag and auto-correlation plots for patterns indicating dependence. Independent state

  18. Does a benchmark reach an independent state?After how many iterations? • DaCapo/OpenJDK 7: ‘large’ and ‘small’ sizes3 executions, 300 iterations/execution. • Inspect run-sequence, lag and auto-correlation plots for patterns indicating dependence. Independent state RECOMMENDATION: Use this manual procedure just once to find how many iterations each benchmark, VM and platform combination requires to reach an independent state.

  19. Reached independent state? DaCapo ‘small’ eclipse6 fop6 fop9 eclipse9 avrora9 bloat6 chart6 luindex6 jython6 h29 hsqldb6 lusearch9 luindex9 jython9 pmd6 tradebeans9 xalan6 sunflow9 tomcat9 pmd9 tradesoap9 xalan9 Intel Xeon: 2 processors x 4 cores x 2-way HT

  20. Reached independent state? DaCapo ‘small’ eclipse6 fop6 fop9 eclipse9 avrora9 bloat6 chart6 luindex6 jython6 h29 hsqldb6 lusearch9 luindex9 jython9 pmd6 tradebeans9 xalan6 sunflow9 tomcat9 pmd9 tradesoap9 xalan9 AMD Opteron: 4 processors x 16 cores

  21. Reached independent state? DaCapo ‘large’ eclipse6 fop6 fop9 eclipse9 avrora9 bloat6 chart6 luindex6 jython6 h29 hsqldb6 lusearch9 luindex9 jython9 pmd6 tradebeans9 xalan6 sunflow9 tomcat9 pmd9 tradesoap9 xalan9 Intel Xeon: 2 processors x 4 cores x 2-way NT

  22. Reached independent state? DaCapo ‘large’ eclipse6 fop6 fop9 eclipse9 avrora9 bloat6 chart6 luindex6 jython6 h29 hsqldb6 lusearch9 luindex9 jython9 pmd6 tradebeans9 xalan6 sunflow9 tomcat9 pmd9 tradesoap9 xalan9 AMD Opteron: 4 processors x 16 cores

  23. Reached independent state? DaCapo ‘small’ eclipse6 fop6 fop9 eclipse9 avrora9 bloat6 chart6 luindex6 jython6 h29 hsqldb6 lusearch9 luindex9 jython9 pmd6 tradebeans9 xalan6 sunflow9 tomcat9 pmd9 tradesoap9 xalan9 AMD Opteron: 4 processors x 16 cores

  24. Many benchmarks do not reach an independent state in a reasonable time. • Most have strong auto-dependencies. • Gradual drift in times andtrends (increases and decreases); abrupt state changes; systematic transitions. • Choice of iteration significantly influences a result. • Problematic for online algorithms which distinguish small differences although the noise is many times larger. • Fortunately, trends tend to be consistent across runs. Some benchmarks don’t reach independent state

  25. Many benchmarks do not reach an independent state in a reasonable time. • Most have strong auto-dependencies. • Gradual drift in times andtrends (increases and decreases); abrupt state changes; systematic transitions. • Choice of iteration significantly influences a result. • Problematic for online algorithms which distinguish small differences although the noise is many times larger. • Fortunately, trends tend to be consistent across runs. Some benchmarks don’t reach independent state RECOMMENDATION: If a benchmark does not reach an independent state in a reasonable time,take the same iteration from each run.

  26. Heuristics don’t do well

  27. Heuristics don’t do well Wastes time!

  28. Heuristics don’t do well Unusable!

  29. Heuristics don’t do well Initialised in reasonable time

  30. Run a benchmark to independence and then repeat a number of iterations, collecting each result? or • Repeatedly, run a benchmark until it is initialised and then collect a single result? • The first method saves experimentation time if • variation between iterations > variation between executions, • initialisation warmup+ VM initialisation is large, and • independence warmup is small. What to repeat?

  31. Run a benchmark to independence and then repeat a number of iterations, collecting each result? or • Repeatedly, run a benchmark until it is initialised and then collect a single result? • The first method saves experimentation time if • variation between iterations > variation between executions, • initialisation warmup+ VM initialisation is large, and • independence warmup is small. What to repeat?

  32. Goal: We want to quantify a performance optimisation in the form of an effect size confidence interval, e.g.“we are 95% confident that system A is faster than system B by 5.5% ± 2.5%”. • We need to repeat executions and take multiple measurements from each. • For a given experimental budget, we want to obtain the tightest possible confidence interval. • Adding repetition at the highest level always increases precision. • but it is often cheaper to add repetitions at lower levels. A clear but rigorous account

  33. How many repetitions to do at which levels? • Run an initial, dimensioning experiment • Gather the cost of a repetition at each level. • Iteration — time to complete an iteration. • Execution — more expensive, need to get to an independent state. • Calculate optimal repetition counts for the real experiment. • Run the real experiment. • Use the optimal repetition counts from the initial experiment. • Calculate the effect size confidence interval. Multi-level repetition

  34. Choose arbitrary repetition counts r1,…,rn • 20 may be enough, 30 if possible, 10 if you must (e.g. if there are many levels) • Then, measure the cost of each level, e.g. • c1  time to get an iteration (iteration duration). • c2  time to get an execution (time to independent state). • c3  time to get a binary (build time) . • Also take the measurement times Yjn...j1 • Y2,1,3= time of the 3rd non-warmup iteration from the 1st execution of the 2nd binary. Initial experiment Initial Experiment

  35. First calculate n biased estimators S12,…,Sn2 Then the unbiased estimators Ti2 iteratively Variance estimators Initial Experiment

  36. First calculate n biased estimators S12,…,Sn2 Then the unbiased estimators Ti2 iteratively Variance estimators Initial Experiment • Mean calculated over all indexes denoted by a bullet

  37. The optimal repetition counts to be used in the real experiments are r1,…,rn-1 • We don’t calculate rn, the repetition count for the highest level • rncan always be increased for more precision. • Calculate the variance estimators Sn2 for the real experiment as before but using the optimal repetition counts r1,…,rn-1 and the measurements from the real experiment. Optimal repetition counts Real Experiment

  38. Asymptotic confidence interval with confidence (1 − a) where is the (1-a/2)-quantile of the I-distribution with n = rn-1 degrees of freedom. See the ISMM’13 paper for details of constructing confidence intervals of execution time ratios. See our technical report for proofs and gory details. Confidence intervals Real Experiment

  39. Confidence interval due to Fieller (1954). and are average execution times from the old and new systems. Variance estimators Sn2 and S’n2 and half-widths h,h’as before. Confidence interval for execution time ratios

  40. For each benchmark/VM/platform… • Conduct a dimensioning experiment to establish the optimal repetition counts for each but the toplevel of the real experiment. • Redimension if only if the benchmark/VM/platform changes. In practise

  41. The confidence half-intervals using optimal repetition counts correspond closely to those obtained by running large numbers of executions (30) and iterations (40). • But repetition counts are much lower. • E.g. lusearch: r1=1 so time better spent repeating executions DaCapo (revisited)

  42. Researchers should provide measures of variation when reporting results. DaCapo and SPEC CPU benchmarks need very different repetition counts on different platforms before they reach an initialised or independent state. Iteration execution times are often strongly auto-dependent: for these, automatic detection of steady state is not applicable. They can waste time or mislead. An one-off (per benchmark/VM/platform) dimensioning experiment can provide the optimal counts for repetition at each level of the real experiments. Conclusions

  43. RECOMMENDATION: Benchmark developers should include our dimensioning methodology as a one-off, per-system configuration requirement.

  44. Code layout experiments

  45. Mean execution times • Minimum threshold for ratio of execution times • Only interested in ‘significant’ performance changes • Improvements in systems research are often small, e.g. 10%. • Many factors influence performance • E.g. memory placement, randomised compilation algorithms, JIT compiler, symbol names… • [Mytkowicz et al., ASPLOS 2009; Gu et al, Component and middleware performance workshop 2004] • Randomisation to avoid measurement bias • E.g. Stabiliser tool [Curtsinger & Berger, UMass TR, 2012] What’s of interest?

  46. Based on 2-level hierarchical experiments • Repeat measurements until standard deviation of last few measurements is small enough. • Quantify changes using a visual or statistical significance test • [Georges et al, OOPSLA 2007; PhD 2008] • Problems • Two levels are not always appropriate • Null hypothesissignificance tests are deprecated in other sciences • Visual tests are overly conservative Current best practice

  47. Null hypothesis: “the 2 systems have the same performance” Tests if the null hypothesis can be rejected: “it is unlikely that the systems have the same performance” Student’s t-test Visual test Null hypothesis significance tests

  48. Visual test • Construct confidence intervals • Do they overlap? • If not, it is unlikely that the systems have the same performance • [If only slight overlap — centre not covered by other CI — fall back to statistical test]

  49. It does not tell us what we want to know • Only if there is a performance change • We could also report the ratio of sample means • But we still don’t know how much of this change is due to uncertainty • The decision is affected by sample size • The larger the sample, the more unlikely even a small and meaningless change becomes • Its limitations have been known for 70 years • Deprecated in many fields: statistics, psychology, medicine, biology, chemistry, sociology, education, ecology… What’s wrong with this?

More Related