1 / 49

An Experimental Comparison of Empirical and Model-based Optimization

An Experimental Comparison of Empirical and Model-based Optimization. Keshav Pingali Cornell University Joint work with: Kamen Yotov 2 ,Xiaoming Li 1 , Gang Ren 1 , Michael Cibulskis 1 , Gerald DeJong 1 , Maria Garzaran 1 , David Padua 1 , Paul Stodghill 2 , Peng Wu 3

rue
Download Presentation

An Experimental Comparison of Empirical and Model-based Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov2 ,Xiaoming Li1, Gang Ren1, Michael Cibulskis1, Gerald DeJong1, Maria Garzaran1, David Padua1, Paul Stodghill2, Peng Wu3 1UIUC, 2Cornell University, 3IBM T.J.Watson

  2. Context: High-performance libraries • Traditional approach • Hand-optimized code: (e.g.) BLAS • Problem: tedious to develop • Alternatives: • Restructuring compilers • General purpose: generate code from high-level specifications • Use architectural models to determine optimization parameters • Performance of optimized code is not satisfactory • Library generators • Problem-specific: (e.g.) ATLAS for BLAS, FFTW for FFT • Use empirical optimization to determine optimization parameters • Believed to produce optimal code • Why are library generators beating compilers?

  3. How important is empirical search? • Model-based optimization and empirical optimization are not in conflict • Use models to prune search space • Search  intelligent search • Use empirical observations to refine models • Understand what essential aspects of reality are missing from model and refine model appropriately • Multiple models are fine • Learn from experience

  4. Previous work • Compared performance of • code generated by a sophisticated compiler like SGI MIPSpro • code generated by ATLAS • found ATLAS code is better • Hard to answer why • Perhaps ATLAS is effectively doing transformations that compilers do not know about • Phase-ordering problem: perhaps compilers are doing transformations in wrong order • Perhaps parameters to transformations are chosen sub-optimally by compiler using models

  5. Compile, Execute, Measure MFLOPS Detect Hardware Parameters ATLAS SearchEngine (MMSearch) ATLAS MMCode Generator (MMCase) L1Size NB MiniMMM Source MU,NU,KU NR xFetch MulAdd MulAdd Latency Latency Detect Hardware Parameters Model ATLAS MMCode Generator (MMCase) L1Size NB MiniMMM Source L1I$Size MU,NU,KU NR xFetch MulAdd MulAdd Latency Latency Our Approach • Original ATLAS Infrastructure • Model-Based ATLAS Infrastructure Detect Hardware Parameters ATLAS SearchEngine (MMSearch) ATLAS MMCode Generator (MMCase) Detect Hardware Parameters Model ATLAS MMCode Generator (MMCase)

  6. Detecting Machine Parameters • Micro-benchmarks • L1Size: L1 Data Cache size • Similar to Hennessy-Patterson book • NR: Number of registers • MulAdd: Fused Multiply Add (FMA) • “c+=a*b” as opposed to “c+=t; t=a*b” • Latency: Latency of FP Multiplication

  7. Compile, Execute, Measure MFLOPS Detect Hardware Parameters ATLAS SearchEngine (MMSearch) ATLAS MMCode Generator (MMCase) L1Size NB MiniMMM Source MU,NU,KU NR xFetch MulAdd MulAdd Latency Latency Detect Hardware Parameters Model ATLAS MMCode Generator (MMCase) L1Size NB MiniMMM Source L1I$Size MU,NU,KU NR xFetch MulAdd MulAdd Latency Latency Code Generation: Compiler View • Original ATLAS Infrastructure • Model-Based ATLAS Infrastructure

  8. BLAS • Let us focus on BLAS-3 • Code for MMM: for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) for (int k = 0; k < K; k++) C[i][j] += A[i][k]*B[k][j] • Properties • Very good reuse: O(N2) data, O(N3) computation • Many optimization opportunities • Few “real” dependencies • Will run poorly on modern machines • Poor use of cache and registers • Poor use of processor pipelines

  9. Optimizations • Cache-level blocking (tiling) • Atlas blocks only for L1 cache • Register-level blocking • Highest level of memory hierarchy • Important to hold array values in registers • Software pipelining • Unroll and schedule operations • Versioning • Dynamically decide which way to compute • Back-end compiler optimizations • Scalar Optimizations • Instruction Scheduling

  10. Cache-level blocking (tiling) • Tiling in ATLAS • Only square tiles (NBxNBxNB) • Working set of tile fits in L1 • Tiles are usually copied to continuous storage • Special “clean-up” code generated for boundaries • Mini-MMM for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j] • NB: Optimization parameter

  11. Short excursion into tiling

  12. MMM miss ratio • L1 Cache Miss Ratio for Intel Pentium III • MMM with N = 1…1300 • 16KB 32B/Block 4-way 8-byte elements

  13. IJK version (large cache) DO I = 1, N//row-major storage DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) B K A C K • Large cache scenario: • Matrices are small enough to fit into cache • Only cold misses, no capacity misses • Miss ratio: • Data size = 3 N2 • Each miss brings in b floating-point numbers • Miss ratio = 3 N2 /b*4N3 = 0.75/bN = 0.019 (b = 4,N=10)

  14. IJK version (small cache) DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) B K A C K • Small cache scenario: • Matrices are large compared to cache • reuse distance is not O(1) => miss • Cold and capacity misses • Miss ratio: • C: N2/b misses (good temporal locality) • A: N3 /b misses (good spatial locality) • B: N3 misses (poor temporal and spatial locality) • Miss ratio  0.25 (b+1)/b = 0.3125 (for b = 4)

  15. MMM experiments Tile size • L1 Cache Miss Ratio for Intel Pentium III • MMM with N = 1…1300 • 16KB 32B/Block 4-way 8-byte elements

  16. Register-level blocking • Micro-MMM • A: MUx1 • B: 1xNU • C: MUxNU • MUxNU+MU+NU registers • Unroll loops by MU, NU, and KU • Mini-MMM with Micro-MMM inside for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1] • MU,NU,KU: optimization parameters KU times

  17. MemoryOperations IFetch Loads M1 A1 Computation Latency=2 M2 A2 M3 MemoryOperations A3 NFetch Loads A1 Computation M4 A4 A2 … … MemoryOperations NFetch Loads … Computation MMU*NU AMU*NU AMU*NU MemoryOperations NFetch Loads Computation … Scheduling Computation MemoryOperations L1 • FMA Present? • Schedule Computation • Using Latency • Schedule Memory Operations • Using IFetch, NFetch,FFetch • Latency, xFetch: optimization parameters L2 L3 … LMU+NU

  18. Comments • Optimization parameters • NB: constrained by size of L1 cache • MU,NU: constrained by NR • KU: constrained by size of I-Cache • xFetch: constrained by #of OL • MulAdd/Latency: related to hardware parameters • Similar parameters would be used by compilers MFlops MFlops parameter parameter Sensitive parameter Insensitive parameter

  19. Compile, Execute, Measure MFLOPS Detect Hardware Parameters ATLAS SearchEngine (MMSearch) ATLAS MMCode Generator (MMCase) L1Size NB MiniMMM Source MU,NU,KU NR xFetch MulAdd MulAdd Latency Latency Detect Hardware Parameters Model ATLAS MMCode Generator (MMCase) L1Size NB MiniMMM Source L1I$Size MU,NU,KU NR xFetch MulAdd MulAdd Latency Latency ATLAS Search • Original ATLAS Infrastructure • Model-Based ATLAS Infrastructure

  20. High-level picture • Multi-dimensional optimization problem: • Independent parameters: NB,MU,NU,KU,… • Dependent variable: MFlops • Function from parameters to variables is given implicitly; can be evaluated repeatedly • One optimization strategy: orthogonal range search • Optimize along one dimension at a time, using reference values for parameters not yet optimized • Not guaranteed to find optimal point, but might come close

  21. Specification of OR Search • Order in which dimensions are optimized • Reference values for un-optimized dimensions at any step • Interval in which range search is done for each dimension

  22. Search strategy • Find Best NB • Find Best MU & NU • Find Best KU • Find Best xFetch • Find Best Latency (lat) • Find non-copy version tile size (NCNB)

  23. Find Best NB • Search in following range • 16 <= NB <= 80 • NB2 <= L1Size • In this search, use simple estimates for other parameters • (eg) KU: Test each candidate for • Full K unrolling (KU = NB) • No K unrolling (KU = 1)

  24. Finding other parameters • Find best MU, NU : try all MU & NU that satisfy • In this step, use best NB from previous step • Find best KU • Find best Latency [1…6] • Find best xFetch IFetch: [2,MU+NU], Nfetch:[1,MU+NU-IFetch] 1 <= MU,NU <= NB MU*NU + MU + NU <= NR

  25. Compile, Execute, Measure MFLOPS Detect Hardware Parameters ATLAS SearchEngine (MMSearch) ATLAS MMCode Generator (MMCase) L1Size NB MiniMMM Source MU,NU,KU NR xFetch MulAdd MulAdd Latency Latency Detect Hardware Parameters Model ATLAS MMCode Generator (MMCase) L1Size NB MiniMMM Source L1I$Size MU,NU,KU NR xFetch MulAdd MulAdd Latency Latency Our Models • Original ATLAS Infrastructure • Model-Based ATLAS Infrastructure

  26. Modeling for Optimization Parameters • Optimization parameters • NB • Hierarchy of Models (later) • MU, NU • KU • maximize subject to L1 Instruction Cache • Latency, MulAdd • hardware parameters • xFetch • set to 2

  27. Largest NB for no capacity/conflict misses B k • Tiles are copied into contiguous memory • Condition for cold misses only: • 3*NB2 <= L1Size k j i NB NB A NB NB

  28. Largest NB for no capacity misses • MMM: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] • Cache model: • Fully associative • Line size 1 Word • Optimal Replacement • Bottom line: N2+N+1<C • One full matrix • One row / column • One element

  29. Extending the Model • Line Size > 1 • Spatial locality • Array layout in memory matters • Bottom line: depending on loop order • either • or

  30. Extending the Model (cont.) • LRU (not optimal replacement) • MMM sample: for (int j = 0; i < N; i++) for (int i = 0; j < N; j++) for (int k = 0; k < N; k++) c[i][j] += a[i][k] * b[k][j] • Bottom line:

  31. Summary: Modeling for Tile Size (NB) • Models of increasing complexity • 3*NB2 ≤ C • Whole work-set fits in L1 • NB2 + NB + 1 ≤ C • Fully Associative • Optimal Replacement • Line Size: 1 word • or • Line Size > 1 word • or • LRU Replacement

  32. Comments • Lot of work in compiler literature on automatic tile size selection • Not much is known about how well these algorithms do in practice • Few comparisons to BLAS • Not obvious how one generalizes our models to more complex codes • Insight needed: how sensitive is performance to tile size?

  33. Experiments • Architectures: • SGI R12000, 270MHz • Sun UltraSPARC III, 900MHz • Intel Pentium III, 550MHz • Measure • Mini-MMM performance • Complete MMM performance • Sensitivity of performance to parameter variations

  34. Installation Time of ATLAS vs. Model

  35. MiniMMM Performance • SGI • ATLAS: 457 MFLOPS • Model: 453 MFLOPS • Difference: 1% • Sun • ATLAS: 1287 MFLOPS • Model: 1052 MFLOPS • Difference: 20% • Intel • ATLAS: 394 MFLOPS • Model: 384 MFLOPS • Difference: 2%

  36. MMM Performance • SGI • Sun • Intel BLAS MODEL F77 ATLAS

  37. Optimization Parameter Values NB MU/NU/KU F/I/N-Fetch Latency • ATLAS • SGI: 64 4/4/64 0/5/1 3 • Sun: 48 5/3/48 0/3/5 5 • Intel: 40 2/1/40 0/3/1 4 • Model • SGI: 62 4/4/62 1/2/2 6 • Sun: 88 4/4/78 1/2/2 4 • Intel: 42 2/1/42 1/2/2 3

  38. Sensitivity to NB and Latency: Sun • Tile Size (NB) • Latency BEST ATLAS MODEL MODEL ATLAS BEST

  39. 3*NB2 ≤ C2 NB2 + NB + 1 ≤ C2 Sensitivity to NB: SGI MODEL ATLAS BEST

  40. Sensitivity to NB: Intel

  41. Sensitivity to MU,NU: SGI

  42. Sensitivity to MU,NU: Sun

  43. Sensitivity to MU,NU: Intel

  44. Shape of register tile matters

  45. Sensitivity to KU

  46. Conclusions • Search is not as important as one might think • Compilers can achieve near-ATLAS performance if they • implement well known transformations • use models to choose parameter values • There is room for improvement in both models and empirical search • Both are 20-25% slower than BLAS • Higher levels of memory hierarchy cannot be neglected

  47. Future Directions • Study hand-written BLAS codes to understand performance gap • Repeat study with FFTW/SPIRAL • Uses search to choose between algorithms • Combine models with search • Use models to speed up empirical search • Use empirical studies to enhance models • Feed insights back into compilers • How do we make it easier for compiler writers to implement transformations? • Use insights to simplify memory system

  48. Information • URL: http://iss.cs.cornell.edu • Email: pingali@cs.cornell.edu

  49. Sensitivity to Latency: Intel

More Related