1 / 38

Statistical Modeling of Feedback Data in an Automatic Tuning System

Statistical Modeling of Feedback Data in an Automatic Tuning System. Richard Vuduc, James Demmel (U.C. Berkeley, EECS) {richie,demmel}@cs.berkeley.edu Jeff Bilmes (Univ. of Washington, EE) bilmes@ee.washington.edu December 10, 2000 Workshop on Feedback-Directed Dynamic Optimization.

Download Presentation

Statistical Modeling of Feedback Data in an Automatic Tuning System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Modeling of Feedback Datain an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) {richie,demmel}@cs.berkeley.edu Jeff Bilmes (Univ. of Washington, EE) bilmes@ee.washington.edu December 10, 2000 Workshop on Feedback-Directed Dynamic Optimization

  2. Context: High Performance Libraries • Libraries can isolate performance issues • BLAS/LAPACK/ScaLAPACK (linear algebra) • VSIPL (signal and image processing) • MPI (distributed parallel communications) • Can we implement libraries … • automatically and portably • incorporating machine-dependent features • matching performance of hand-tuned implementations • leveraging compiler technology • using domain-specific knowledge

  3. Generate and Search:An Automatic Tuning Methodology • Given a library routine • Write parameterized code generators • parameters • machine (e.g., registers, cache, pipeline) • input (e.g., problem size) • problem-specific transformations • output high-level source (e.g., C code) • Search parameter spaces • generate an implementation • compile using native compiler • measure performance (“feedback”)

  4. Tuning System Examples • Linear algebra • PHiPAC (Bilmes, et al., 1997) • ATLAS (Whaley and Dongarra, 1998) • Sparsity (Im and Yelick, 1999) • Signal Processing • FFTW (Frigo and Johnson, 1998) • SPIRAL (Moura, et al., 2000) • Parallel Communcations • Automatically tuned MPI collective operations(Vadhiyar, et al. 2000) • Related: Iterative compilation (Bodin, et al., 1998)

  5. Road Map • Context • The Search Problem • Problem 1: Stopping searches early • Problem 2: High-level run-time selection • Summary

  6. The Search Problem in PHiPAC • PHiPAC (Bilmes, et al., 1997) • produces dense matrix multiply (matmul) implementations • generator parameters include • size and depth of fully unrolled “core” matmul • rectangular, multi-level cache tile sizes • 6 flavors of software pipelining • scaling constants, transpose options, precisions, etc. • An experiment • fix software pipelining method • vary register tile sizes • 500 to 2500 “reasonable” implementations on 6 platforms

  7. A Needle in a Haystack, Part I

  8. Road Map • Context • The Search Problem • Problem 1: Stopping searches early • Problem 2: High-level run-time selection • Summary

  9. Problem 1: Stopping Searches Early • Assume • dedicated resources limited • near-optimal implementation okay • Recall the search procedure • generate implementations at random • measure performance • Can we stop the search early? • how early is “early?” • guarantees on quality?

  10. An Early Stopping Criterion • Performance scaled from 0 (worst) to 1 (best) • Goal: Stop after t implementations whenProb[ Mt <= 1-e ] < a • Mt max observed performance at t • e proximity to best • a error tolerance • example: “find within top 5% with error 10%” • e = .05, a = .1 • Can show probability depends only onF(x) = Prob[ performance <= x ] • Idea: Estimate F(x) using observed samples

  11. Stopping time (300 MHz Pentium-II)

  12. Stopping Time (Cray T3E Node)

  13. Road Map • Context • The Search Problem • Problem 1: Stopping searches early • Problem 2: High-level run-time selection • Summary

  14. N B K K A C M Problem 2: Run-Time Selection • Assume • one implementation is not best for all inputs • a few, good implementations known • can benchmark • How do we choose the “best” implementationat run-time? • Example: matrix multiply, tuned for small (L1), medium (L2), and large workloads C = C + A*B

  15. Truth Map (Sun Ultra-I/170)

  16. A Formal Framework • Given • m implementations • n sample inputs (training set) • execution time • Find • decision function f(s) • returns “best”implementationon input s • f(s) cheap to evaluate

  17. Solution Techniques (Overview) • Method 1: Cost Minimization • minimize overall execution time on samples (boundary modeling) • pro: intuitive, f(s) cheap • con: ad hoc, geometric assumptions • Method 2: Regression (Brewer, 1995) • model run-time of each implementation e.g., Ta(N) = b3N 3 + b2N 2 + b1N + b0 • pro: simple, standard • con: user must define model • Method 3: Support Vector Machines • statistical classification • pro: solid theory, many successful applications • con: heavy training and prediction machinery

  18. Results 1: Cost Minimization

  19. Results 2: Regression

  20. Results 3: Classification

  21. Quantitative Comparison Note: Cost of regression and cost-min prediction ~O(3x3 matmul) Cost of SVM prediction ~O(32x32 matmul)

  22. Road Map • Context • The Search Problem • Problem 1: Stopping searches early • Problem 2: High-level run-time selection • Summary

  23. Conclusions • Search beneficial • Early stopping • simple (random + a little bit) • informative criteria • High-level run-time selection • formal framework • error metrics • To do • other stopping models (cost-based) • large design space for run-time selection

  24. Extra Slides More detail (time and/or questions permitting)

  25. PHiPAC Performance (Pentium-II)

  26. PHiPAC Performance (Ultra-I/170)

  27. PHiPAC Performance (IBM RS/6000)

  28. PHiPAC Performance (MIPS R10K)

  29. Needle in a Haystack, Part II

  30. Performance Distribution (IBM RS/6000)

  31. Performance Distribution (Pentium II)

  32. Performance Distribution (Cray T3E Node)

  33. Performance Distribution (Sun Ultra-I)

  34. Cost Minimization • Decision function • Minimize overall execution time on samples • Softmax weight (boundary) functions

  35. Regression • Decision function • Model implementation running time (e.g., square matmul of dimension N) • For general matmul with operand sizes (M, K, N), we generalize the above to include all product terms • MKN, MK, KN, MN, M, K, N

  36. Support Vector Machines • Decision function • Binary classifier

  37. Proximity to Best (300 MHz Pentium-II)

  38. Proximity to Best (Cray T3E Node)

More Related