1.2k likes | 1.39k Views
Autotuning sparse matrix kernels. Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory February 28, 2007. Predictions (2003). Need for “autotuning” will increase over time
E N D
Autotuning sparse matrix kernels Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory February 28, 2007
Predictions (2003) • Need for “autotuning” will increase over time • Improve performance for given app & machine using automated experiments • Example: Sparse matrix-vector multiply (SpMV), 1987 to present • Untuned: 10% of peak or less, decreasing • Tuned: 2x speedup, increasing over time • Tuning is getting harder (qualitative) • More complex machines & workloads • Parallelism
Is tuning getting easier? // y <-- y + A*x for all A(i,j): y(i) += A(i,j) * x(j) // Compressed sparse row (CSR) for each row i: t = 0 for k=ptr[i] to ptr[i+1]-1: t += A[k] * x[J[k]] y[i] = t • Exploit 8x8 dense blocks • As r x c, Mflop/s
Mflop/s (31.1%) Reference Mflop/s (7.6%) Speedups on Itanium 2: The need for search
Mflop/s (31.1%) Best: 4x2 Reference Mflop/s (7.6%) Speedups on Itanium 2: The need for search
Better, worse, or about the same?Itanium 2, 900 MHz 1.3 GHz
Better, worse, or about the same?Itanium 2, 900 MHz 1.3 GHz * Reference improves *
Better, worse, or about the same?Itanium 2, 900 MHz 1.3 GHz * Best possible worsens slightly *
Better, worse, or about the same?Pentium M Core 2 Duo (1-core)
Better, worse, or about the same?Pentium M Core 2 Duo (1-core) * Reference & best improve; relative speedup improves (~1.4 to 1.6) *
Better, worse, or about the same?Pentium M Core 2 Duo (1-core) * Note: Best fraction of peak decreased from 11% 9.6% *
Better, worse, or about the same?Power4 Power5 * Reference worsens! *
Better, worse, or about the same?Power4 Power5 * Relative importance of tuning increases *
A framework for performance tuningSource: SciDAC Performance Engineering Research Institute (PERI)
Outline • Motivation • OSKI: An autotuned sparse kernel library • Application-specific optimization “in the wild” • Toward end-to-end application autotuning • Summary and future work
Outline • Motivation • OSKI: An autotuned sparse kernel library • Application-specific optimization “in the wild” • Toward end-to-end application autotuning • Summary and future work
OSKI: Optimized Sparse Kernel Interface • Autotuned kernels for user’s matrix & machine • BLAS-style interface: mat-vec (SpMV), tri. solve (TrSV), … • Hides complexity of run-time tuning • Includes fast locality-aware kernels: ATA*x, … • Faster than standard implementations • Standard SpMV < 10% peak, vs. up to 31% with OSKI • Up to 4x faster SpMV, 1.8x TrSV, 4xATA*x, … • For “advanced” users & solver library writers • PETSc extension available (OSKI-PETSc) • Kokkos (for Trilinos) by Heroux • Adopted by ClearShape, Inc. for shipping product (2x speedup)
Tunable matrix-specific optimization techniques • Optimizations for SpMV • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 3x over CSR • Multiple vectors (SpMM): 7x over CSR • And combinations… • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A*x: 2x over CSR, 1.5x over RB
Tuning for workloads • Bi-conjugate gradients - equal mix of A*x and AT*y • 3x1: Ax, ATy = 1053, 343 Mflop/s 517 Mflop/s • 3x3: Ax, ATy = 806, 826 Mflop/s 816 Mflop/s • Higher-level operation - (Ax, ATy) kernel • 3x1: 757 Mflop/s • 3x3: 1400 Mflop/s • Matrix powers (Ak*x) with data structure transformations • A2*x: up to 2x faster • New latency-tolerant solvers? (Hoemmen’s thesis, on-going at UCB)
How OSKI tunes (Overview) Application Run-Time Library Install-Time (offline)
How OSKI tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data
How OSKI tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models
How OSKI tunes (Overview) Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls
Examples of OSKI’s early impact • Early adopter: ClearShape, Inc. • Core product: lithography simulator • 2x speedup on full simulation after using OSKI • Proof-of-concept: SLAC T3P accelerator cavity design simulator • SpMV dominates execution time • Symmetry, 2x2 block structure • 2x speedups
Strengths and limitations of the library approach • Strengths • Isolates optimization in the library for portable performance • Exploits domain-specific information aggressively • Handles run-time tuning naturally • Limitations • “Generation Me”: What about my application and its abstractions? • Run-time tuning: run-time overheads • Limited context for optimization (without delayed evaluation) • Limited extensibility (fixed interfaces)
Outline • Motivation • OSKI: An autotuned sparse kernel library • Application-specific optimization “in the wild” • Toward end-to-end application autotuning • Summary and future work
Tour of application-specific optimizations • Five case studies • Common characteristics • Complex code • Heavy use of abstraction • Use generated code (e.g., SWIG C++/Python bindings) • Benefit from extensive code and data restructuring • Multiple bottlenecks
[1] Loop transformations for SMG2000 • SMG2000, implements semi-coarsening multigrid on structured grids (ASC Purple benchmark) • Residual computation has an SpMV bottleneck • Loop below looks simple but non-trivial to extract for (si = 0; si < NS; ++si) for (k = 0; k < NZ; ++k) for (j = 0; j < NY; ++j) for (i = 0; i < NX; ++i) r[i + j*JR + k*KR] -= A[i + j*JA + k*KA + SA[si]] * x[i + j*JX + k*KX + Sx[si]]
[1] Before transformation for (si = 0; si < NS; si++) /* Loop1 */ for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */ for (ii = 0; ii < NX; ii++) { /* Loop4 */ r[ii + jj*Jr + kk*Kr] -= A[ii + jj*JA + kk*KA + SA[si]] * x[ii + jj*JA + kk*KA + SA[si]]; } /* Loop4 */ } /* Loop3 */ } /* Loop2 */ } /* Loop1 */
[1] After transformation, including interchange, unrolling, and prefetching for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */ for (si = 0; si < NS; si++) { /* Loop1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core Loop4 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core Loop4 */ for ( ; ii < NX; ii++) { /* fringe Loop4 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe Loop4 */ } /* Loop1 */ } /* Loop3 */ } /* Loop2 */
[1] Loop transformations for SMG2000 • 2x speedup on kernel from specialization, loop interchange, unrolling, prefetching • But only 1.25x overall---multiple bottlenecks • Lesson: Need complex sequences of transformations • Use profiling to guide • Inspect run-time data for specialization • Transformations are automatable • Research topic: Automated specialization of hypre?
Accelerator design code from SLAC calcBasis() very expensive Scaling problems as |Eigensystem| grows In principle, loop interchange or precomputation via slicing possible /* Post-processing phase */ foreach mode in Eigensystem foreach elem in Mesh b = calcBasis (elem) f = calcField (b, mode) [2] Slicing and dicing W3P
Accelerator design code calcBasis() very expensive Scaling problems as |Eigensystem| grows In principle, loop interchange or precomputation via slicing possible Challenges in practice “Loop nest” ~ 500+ LOC 150+ LOC to calcBasis() calcBasis() in 6-deep call chain, 4-deep loop nest, 2 conditionals File I/O Changes must be unobtrusive /* Post-processing phase */ foreach mode in Eigensystem foreach elem in Mesh // { … b = calcBasis (elem) // } f = calcField (b, mode) writeDataToFiles (…); [2] Slicing and dicing W3P
[2] W3P: Impact and lessons • 4-5x speedup for post-processing step; 1.5x overall • Changes “checked-in” • Lesson: Need clean source-level transformations • To automate, need robust program analysis and developer guidance • Research: Annotation framework for developers • [w/ Quinlan, Schordan, Yi: POHLL’06]
[3] Structure splitting • Convert (array of structs) into (struct of arrays) • Improve spatial locality through increased stride-1 accesses • Make code hardware-prefetch and vector/SIMD unit “friendly”c struct Type { double p; double x, y, z; double E; int k; } X[N], Y[N]; for (i = 0; i < N; i++) Y[i].E += Y[X[i].k].p; double Xp[N]; double Xx[N], Xy[N], Xz[N]; double XE[N]; int Xk[N]; // … same for Y … for (i = 0; i < N; i++) YE[i] += sqrt (Yp[Xk[i]]);
[3] Structure splitting: Impact and challenges • 2x speedup on a KULL benchmark (suggested by Brian Miller) • Implementation challenges • Potentially affects entire code • Can apply only locally, at a cost • Extra storage • Overhead of copying • Tedious to do by hand • Lesson: Extensive data restructuring may be necessary • Research: When and how best to split?
[4] Finding a loop-fusion needle in a haystack • Interprocedural loop fusion finder [w/ B. White : Cornell U.] • Known example had 2x speedup on benchmark (Miller) • Built “abstraction-aware” analyzer using ROSE • First pass: Associate “loop signatures” with each function • Second pass: Propagate signatures through call chains for (Zone::iterator z = zones.begin (); z != zones.end (); ++z) for (Corner::iterator c = (*z).corners().begin (); …) for (int s = 0; s < c->sides().size(); s++) …
[4] Finding a loop-fusion needle in a haystack • Found 6 examples of 3- and 4-deep nested loops • “Analysis-only” tool • Finds, though does not verify/transform • Lesson: “Classical” optimizations relevant to abstraction use • Research • Recognizing and optimizing abstractions [White’s thesis, on-going] • Extending traditional optimizations to abstraction use
[5] Aggregating messages (on-going) • Idea: Merge sends (suggested by Miller) DataType A; // … operations on A … A.allToAll(); // … DataType B; // … operations on B … B.allToAll(); DataType A; // … operations on A … // … DataType B; // … operations on B … bulkAllToAll(A, B); • Implementing a fully automated translator to find and transform • Research: When and how best to aggregate?
Summary of application-specific optimizations • Like library-based approach, exploit knowledge for big gains • Guidance from developer • Use run-time information • Would benefit from automated transformation tools • Real code is hard to process • Changes may become part of software re-engineering • Need robust analysis and transformation infrastructure • Range of tools possible: analysis and/or transformation • No silver bullets or magic compilers
Outline • Motivation • OSKI: An autotuned sparse kernel library • “Real world” optimization • Toward end-to-end application autotuning • Summary and future work