1 / 38

Efficient Sparse Matrix-Vector Multiplication Optimization Techniques

Explore optimization techniques for improving the performance of sparse matrix-vector multiplication, including register blocking and cache blocking methods. Learn about the motivation, implementation, and impact of optimized libraries for various applications like iterative solvers and simulation areas.

jacka
Download Presentation

Efficient Sparse Matrix-Vector Multiplication Optimization Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing the Performance of Sparse Matrix-Vector Multiplication Eun-Jin Im U.C.Berkeley U.C.Berkeley

  2. Overview • Motivation • Optimization techniques • Register Blocking • Cache Blocking • Multiple Vectors • Sparsity system • Related Work • Contribution • Conclusion U.C.Berkeley

  3. Motivation : Usage • Sparse Matrix-Vector Multiplication • Usage of this operation: • Iterative Solvers • Explicit Methods • Eigenvalue and Singular Value Problems Applications in structure modeling, fluid dynamics, document retrieval(Latent Semantic Indexing) and many other simulation areas U.C.Berkeley

  4. Motivation : Performance (1) • Matrix-vector multiplication (BLAS2) is slower than matrix-matrix multiplication (BLAS3) • For example, on 167 MHz UltraSPARC I, • Vendor optimized matrix-vector multiplication: 57Mflops • Vendor optimized matrix-matrix multiplication: 185Mflops • The reason: lower ratio of the number of floating point operations to the number of memory operation U.C.Berkeley

  5. Motivation : Performance (2) • Sparse matrix operation is slower than dense matrix operation. • For example, on 167 MHz UltraSPARC I, • Dense matrix-vector multiplication : naïve implementation : 38Mflops vendor optimized implementation : 57Mflops • Sparse matrix-vector multiplication (Naïve implementation) 5.7 - 25Mflops • The reason : indirect data structure, thus inefficient memory accesses U.C.Berkeley

  6. Motivation : Optimized libraries • Old approach : Hand-Optimized Libraries • Vendor-supplied BLAS, LAPACK • New approach : Automatic generation of libraries • PHiPAC (dense linear algebra) • ATLAS (dense linear algebra) • FFTW (fast fourier transform) • Our approach : Automatic generation of libraries for sparse matrices Additional dimension : nonzero structure of sparse matrices U.C.Berkeley

  7. Sparse Matrix Formats • There are large number of sparse matrix formats. • Point-entry Coordinate format (COO), Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), Sparse Diagonal (DIA), … • Block-entry Block Coordinate (BCO), Block Sparse Row (BSR), Block Sparse Column (BSC), Block Diagonal (BDI), Variable Block Compressed Sparse Row (VBR), … U.C.Berkeley

  8. Compressed Sparse Row Format • We internally use CSR format, because it is relatively efficient format U.C.Berkeley

  9. Optimization Techniques • Register Blocking • Cache Blocking • Multiple vector U.C.Berkeley

  10. 0 2 4 0 4 2 4 A00 A01 A10 A11 A04 0 0 A15 A22 0 A32 A33 A25 0 A34 A35 Register Blocking • Blocked Compressed Sparse Row Format • Advantages of the format • Better temporal locality in registers • The multiplication loop can be unrolled for better performance U.C.Berkeley

  11. Register Blocking : Fill Overhead • We use uniform block size, adding fill overhead. fill overhead = 12/7 = 1.71 • This increases both space and the number of floating point operations. U.C.Berkeley

  12. Register Blocking • Dense Matrix profile on an UltraSPARC I (input to the performance model) U.C.Berkeley

  13. Register Blocking : Selecting the block size • The hard part of the problem is picking the block size so that : • It minimizes the fill overhead • It maximizes the raw performance • Two approaches : • Exhaustive search • Using a model U.C.Berkeley

  14. Register Blocking: Performance model • Two components to the performance model • Multiplication performance of dense matrix represented in sparse format • Estimated fill overhead Predicted performance for block size r x c dense r x c blocked performance = fill overhead U.C.Berkeley

  15. Benchmark matrices • Matrix 1: Dense matrix (1000 x 1000) • Matrices 2-17 : Finite Element Method matrices • Matrices 18-39 : matrices from Structural Engineering, Device Simulation • Matrices 40-44 : Linear Programming matrices • Matrix 45 : document retrieval matrix used for Latent Semantic Indexing • Matrix 46 : random matrix (10000 x 10000, 0.15%) U.C.Berkeley

  16. Register Blocking : Performance • The optimization is effective most on FEM matrices and dense matrix (lower-numbered). U.C.Berkeley

  17. Register Blocking : Performance • Speedup is generally best on MIPS R10000, which is competitive with the dense BLAS performance. (DGEMV/DGEMM = 0.38) U.C.Berkeley

  18. Register Blocking : Validation of Performance Model • Comparison to the performance of exhaustive search (yellow bars, block sizes in lower row) on a subset of the benchmark matrices • The exhaustive search does not produce much better result. U.C.Berkeley

  19. Register Blocking : Overhead • Pre-computation overhead : • Estimating fill overhead (red bars) • Reorganizing the matrix (yellow bars) • The ratio means the number of repetitions for which the optimization is beneficial. U.C.Berkeley

  20. Cache Blocking • Temporal locality of access to source vector Source vector x Destination Vector y In memory U.C.Berkeley

  21. Cache Blocking : Performance • MIPS speedup is generally better. larger cache, larger miss penalty (26/589 ns for MIPS, 36/268 ns for Ultra.) • Except document retrieval and random matrix. U.C.Berkeley

  22. Cache Blocking : Performance on document retrieval matrix • Document retrieval matrix : 10K x 256K, 37M nonzeros, SVD is applied for LSI(Latent Semantic Indexing) • The nonzero elements are spread across the matrix, with no dense cluster. • Peak at 16K x 16K cache block with speedup 3.1 U.C.Berkeley

  23. Cache Blocking : When and how to use cache blocking • From the experiment, the matrices for which cache blocking is most effective are large and “random”. • We developed a measurement of “randomness” of matrix. • We perform search in coarse grain, to decide cache block size. U.C.Berkeley

  24. Combination of Register and Cache blocking : UltraSPARC • The combination is rarely beneficial, often slower than either of the two optimization. U.C.Berkeley

  25. Combination of Register and Cache blocking : MIPS U.C.Berkeley

  26. Multiple Vector Multiplication • Better chance of optimization : BLAS2 vs. BLAS3 Repetition of single-vector case Multiple-vector case U.C.Berkeley

  27. Multiple Vector Multiplication : Performances • Register blocking performance • Cache blocking performance U.C.Berkeley

  28. Multiple Vector Multiplication :Register Blocking Performance • The speedup is larger than single vector register blocking. • Even the performance of the matrices that did not speedup improved. (middle group in UltraSPARC) U.C.Berkeley

  29. Multiple Vector Multiplication : Cache Blocking Performance MIPS UltraSPARC • Noticeable speedup for the matrices that did not speedup (UltraSPARC) • Block sizes are much smaller than that of single vector cache blocking. U.C.Berkeley

  30. Sparsity System : Purpose • Guide a choice of optimization • Automatic selection of optimization parameters such as block size, number of vectors • http://comix.cs.berkeley.edu/~ejim/sparsity U.C.Berkeley

  31. Sparsity System : Organization Example matrix Sparsity Machine Profiler Machine Performance Profile Sparsity Optimizer Optimized code, drivers Maximum Number of vectors U.C.Berkeley

  32. Summary : Speedup of Sparsity on UltraSPARC On UltraSPARC, up to 3x for single vector, 4.7x for multiple vector Single Vector Multiple Vector U.C.Berkeley

  33. Summary : Speedup of Sparsity on MIPS On MIPS, up to 3x single vector, 6x for multiple vector Single Vector Multiple Vector U.C.Berkeley

  34. Summary : Overhead of Sparsity Optimization • The number of iteration = Overhead time Time saved • The BLAS Technical Forum include a parameter in the matrix creation routine to indicate how many times the operation is performed. U.C.Berkeley

  35. Related Work (1) • Dense Matrix Optimization • Loop transformation by compilers : M. Wolf, etc. • Hand-optimized libraries : BLAS, LAPACK • Automatic Generation of Libraries • PHiPAC, ATLAS and FFTW • Sparse Matrix Standardization and Libraries • BLAS Technical Forum • NIST Sparse BLAS, MV++, SparseLib++, TNT • Hand Optimization of Sparse Matrix-Vector Multi. • S. Toledo, Oliker et. al. U.C.Berkeley

  36. Related Work (2) • Sparse Matrix Packages • SPARSKIT, PSPARSELIB, Aztec, BlockSolve95, Spark98 • Compiling Sparse Matrix Code • Sparse compiler (Bik), Bernoulli compiler (Kotlyar) • On-demand Code Generation • NIST SparseBLAS, Sparse compiler U.C.Berkeley

  37. Contribution • Thorough investigation of memory hierarchy optimization for sparse matrix-vector multiplication • Performance study on benchmark matrices • Development of performance model to choose optimization parameter • Sparsity system for automatic tuning and code generation of sparse matrix-vector multiplication U.C.Berkeley

  38. Conclusion • Memory hierarchy optimization for sparse matrix-vector multiplication • Register Blocking : matrices with dense local structure benefit • Cache Blocking : large matrices with random structure benefit • Multiple vector multiplication improves the performance further because of reuse of matrix elements • The choice of optimization depends on both matrix structure and machine architecture. • The automated system helps this complicated and time-consuming process. U.C.Berkeley

More Related