Efficient Sparse Matrix-Vector Multiplication Optimization Techniques

Optimizing the Performance of Sparse Matrix-Vector Multiplication Eun-Jin Im U.C.Berkeley U.C.Berkeley

Overview • Motivation • Optimization techniques • Register Blocking • Cache Blocking • Multiple Vectors • Sparsity system • Related Work • Contribution • Conclusion U.C.Berkeley

Motivation : Usage • Sparse Matrix-Vector Multiplication • Usage of this operation: • Iterative Solvers • Explicit Methods • Eigenvalue and Singular Value Problems Applications in structure modeling, fluid dynamics, document retrieval(Latent Semantic Indexing) and many other simulation areas U.C.Berkeley

Motivation : Performance (1) • Matrix-vector multiplication (BLAS2) is slower than matrix-matrix multiplication (BLAS3) • For example, on 167 MHz UltraSPARC I, • Vendor optimized matrix-vector multiplication: 57Mflops • Vendor optimized matrix-matrix multiplication: 185Mflops • The reason: lower ratio of the number of floating point operations to the number of memory operation U.C.Berkeley

Motivation : Performance (2) • Sparse matrix operation is slower than dense matrix operation. • For example, on 167 MHz UltraSPARC I, • Dense matrix-vector multiplication : naïve implementation : 38Mflops vendor optimized implementation : 57Mflops • Sparse matrix-vector multiplication (Naïve implementation) 5.7 - 25Mflops • The reason : indirect data structure, thus inefficient memory accesses U.C.Berkeley

Motivation : Optimized libraries • Old approach : Hand-Optimized Libraries • Vendor-supplied BLAS, LAPACK • New approach : Automatic generation of libraries • PHiPAC (dense linear algebra) • ATLAS (dense linear algebra) • FFTW (fast fourier transform) • Our approach : Automatic generation of libraries for sparse matrices Additional dimension : nonzero structure of sparse matrices U.C.Berkeley

Sparse Matrix Formats • There are large number of sparse matrix formats. • Point-entry Coordinate format (COO), Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), Sparse Diagonal (DIA), … • Block-entry Block Coordinate (BCO), Block Sparse Row (BSR), Block Sparse Column (BSC), Block Diagonal (BDI), Variable Block Compressed Sparse Row (VBR), … U.C.Berkeley

Compressed Sparse Row Format • We internally use CSR format, because it is relatively efficient format U.C.Berkeley

Optimization Techniques • Register Blocking • Cache Blocking • Multiple vector U.C.Berkeley

0 2 4 0 4 2 4 A00 A01 A10 A11 A04 0 0 A15 A22 0 A32 A33 A25 0 A34 A35 Register Blocking • Blocked Compressed Sparse Row Format • Advantages of the format • Better temporal locality in registers • The multiplication loop can be unrolled for better performance U.C.Berkeley

Register Blocking : Fill Overhead • We use uniform block size, adding fill overhead. fill overhead = 12/7 = 1.71 • This increases both space and the number of floating point operations. U.C.Berkeley

Register Blocking • Dense Matrix profile on an UltraSPARC I (input to the performance model) U.C.Berkeley

Register Blocking : Selecting the block size • The hard part of the problem is picking the block size so that : • It minimizes the fill overhead • It maximizes the raw performance • Two approaches : • Exhaustive search • Using a model U.C.Berkeley

Register Blocking: Performance model • Two components to the performance model • Multiplication performance of dense matrix represented in sparse format • Estimated fill overhead Predicted performance for block size r x c dense r x c blocked performance = fill overhead U.C.Berkeley

Benchmark matrices • Matrix 1: Dense matrix (1000 x 1000) • Matrices 2-17 : Finite Element Method matrices • Matrices 18-39 : matrices from Structural Engineering, Device Simulation • Matrices 40-44 : Linear Programming matrices • Matrix 45 : document retrieval matrix used for Latent Semantic Indexing • Matrix 46 : random matrix (10000 x 10000, 0.15%) U.C.Berkeley

Register Blocking : Performance • The optimization is effective most on FEM matrices and dense matrix (lower-numbered). U.C.Berkeley

Register Blocking : Performance • Speedup is generally best on MIPS R10000, which is competitive with the dense BLAS performance. (DGEMV/DGEMM = 0.38) U.C.Berkeley

Register Blocking : Validation of Performance Model • Comparison to the performance of exhaustive search (yellow bars, block sizes in lower row) on a subset of the benchmark matrices • The exhaustive search does not produce much better result. U.C.Berkeley

Register Blocking : Overhead • Pre-computation overhead : • Estimating fill overhead (red bars) • Reorganizing the matrix (yellow bars) • The ratio means the number of repetitions for which the optimization is beneficial. U.C.Berkeley

Cache Blocking • Temporal locality of access to source vector Source vector x Destination Vector y In memory U.C.Berkeley

Cache Blocking : Performance • MIPS speedup is generally better. larger cache, larger miss penalty (26/589 ns for MIPS, 36/268 ns for Ultra.) • Except document retrieval and random matrix. U.C.Berkeley

Cache Blocking : Performance on document retrieval matrix • Document retrieval matrix : 10K x 256K, 37M nonzeros, SVD is applied for LSI(Latent Semantic Indexing) • The nonzero elements are spread across the matrix, with no dense cluster. • Peak at 16K x 16K cache block with speedup 3.1 U.C.Berkeley

Cache Blocking : When and how to use cache blocking • From the experiment, the matrices for which cache blocking is most effective are large and “random”. • We developed a measurement of “randomness” of matrix. • We perform search in coarse grain, to decide cache block size. U.C.Berkeley

Combination of Register and Cache blocking : UltraSPARC • The combination is rarely beneficial, often slower than either of the two optimization. U.C.Berkeley

Combination of Register and Cache blocking : MIPS U.C.Berkeley

Multiple Vector Multiplication • Better chance of optimization : BLAS2 vs. BLAS3 Repetition of single-vector case Multiple-vector case U.C.Berkeley

Multiple Vector Multiplication : Performances • Register blocking performance • Cache blocking performance U.C.Berkeley

Multiple Vector Multiplication :Register Blocking Performance • The speedup is larger than single vector register blocking. • Even the performance of the matrices that did not speedup improved. (middle group in UltraSPARC) U.C.Berkeley

Multiple Vector Multiplication : Cache Blocking Performance MIPS UltraSPARC • Noticeable speedup for the matrices that did not speedup (UltraSPARC) • Block sizes are much smaller than that of single vector cache blocking. U.C.Berkeley

Sparsity System : Purpose • Guide a choice of optimization • Automatic selection of optimization parameters such as block size, number of vectors • http://comix.cs.berkeley.edu/~ejim/sparsity U.C.Berkeley

Sparsity System : Organization Example matrix Sparsity Machine Profiler Machine Performance Profile Sparsity Optimizer Optimized code, drivers Maximum Number of vectors U.C.Berkeley

Summary : Speedup of Sparsity on UltraSPARC On UltraSPARC, up to 3x for single vector, 4.7x for multiple vector Single Vector Multiple Vector U.C.Berkeley

Summary : Speedup of Sparsity on MIPS On MIPS, up to 3x single vector, 6x for multiple vector Single Vector Multiple Vector U.C.Berkeley

Summary : Overhead of Sparsity Optimization • The number of iteration = Overhead time Time saved • The BLAS Technical Forum include a parameter in the matrix creation routine to indicate how many times the operation is performed. U.C.Berkeley

Related Work (1) • Dense Matrix Optimization • Loop transformation by compilers : M. Wolf, etc. • Hand-optimized libraries : BLAS, LAPACK • Automatic Generation of Libraries • PHiPAC, ATLAS and FFTW • Sparse Matrix Standardization and Libraries • BLAS Technical Forum • NIST Sparse BLAS, MV++, SparseLib++, TNT • Hand Optimization of Sparse Matrix-Vector Multi. • S. Toledo, Oliker et. al. U.C.Berkeley

Related Work (2) • Sparse Matrix Packages • SPARSKIT, PSPARSELIB, Aztec, BlockSolve95, Spark98 • Compiling Sparse Matrix Code • Sparse compiler (Bik), Bernoulli compiler (Kotlyar) • On-demand Code Generation • NIST SparseBLAS, Sparse compiler U.C.Berkeley

Contribution • Thorough investigation of memory hierarchy optimization for sparse matrix-vector multiplication • Performance study on benchmark matrices • Development of performance model to choose optimization parameter • Sparsity system for automatic tuning and code generation of sparse matrix-vector multiplication U.C.Berkeley

Conclusion • Memory hierarchy optimization for sparse matrix-vector multiplication • Register Blocking : matrices with dense local structure benefit • Cache Blocking : large matrices with random structure benefit • Multiple vector multiplication improves the performance further because of reuse of matrix elements • The choice of optimization depends on both matrix structure and machine architecture. • The automated system helps this complicated and time-consuming process. U.C.Berkeley

Efficient Sparse Matrix-Vector Multiplication Optimization Techniques

Efficient Sparse Matrix-Vector Multiplication Optimization Techniques

Presentation Transcript

sparse matrix-vector multiplication

A benchmark for sparse matrix-vector multiplication

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Automatic Performance Tuning of Sparse-Matrix-Vector-Multiplication (SpMV) and Iterative Sparse Solvers

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes

Sparse Matrix Dense Vector Multiplication

Fast Sparse Matrix Multiplication

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Tuning Sparse Matrix Vector Multiplication for multi-core processors

CS267 – Lecture 14 Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

CS267 – Lecture 14 Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Optimizing Cache Performance in Matrix Multiplication

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs

CS267 – Lecture 14 Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

CS267 – Lecture 14 Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)