Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007 ankit@eecs.berkeley.edu volkov@eecs.berkeley.edu

SpM×V and its Applications vector: x • Sparse Matrix Vector Multiply (SpM×V): y y+A∙x • x, y are dense vectors • x: source vector • y: destination vector • A is a sparse matrix (<1% of entries are nonzero) • Applications employing SpM×V in the inner loop • Least Squares Problems • Eigenvalue Problems matrix: A vector: y

Storing a Matrix in Memory • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i]  y[i] + val[l] ∙ x[ind[l]] Compressed Sparse Row Data Structure and Algorithm

What’s so hard about it? • Reason for Poor Performance of Naïve Implementation • Poor locality (indirect and irregular memory accesses) • Limited by speed of main memory • Poor instruction mix (low flops to memory operations ratio) • Algorithm dependent on non-zero structure of matrix • Dense matrices vs Sparse matrices

Register-Level Blocking (SPARSITY): 3x3 Example

Register-Level Blocking (SPARSITY): 3x3 Example BCSR with uniform, aligned grid

Register-Level Blocking (SPARSITY): 3x3 Example Fill-in zeros: trade-off extra ops for better efficiency

Blocked Compressed Sparse Row • Inner loop performs floating point multiply-add on each non-zero in block instead of just one non-zero • Reduces the number of times the source vector x has to be brought back into memory • Reduces the number of indices that have to be stored and loaded

Mflop/s Best: 4x2 Reference Mflop/s The Payoff: Speedups on Itanium 2

Explicit Software Pipelining • ORIGINAL CODE: • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i]  y[i] + val[l] ∙ x[ind[l]] • SOFTWARE PIPELINED CODE • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i]  y[i] + val_1 ∙ x_1 • val_1 = val[l + 1] • x_1 = x[ind_2] • ind_2 = ind[l + 2]

Explicit Software Prefetching • ORIGINAL CODE: • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i]  y[i] + val[l] ∙ x[ind[l]] • SOFTWARE PREFETCHED CODE • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i]  y[i] + val[l] ∙ x[ind[l]] • pref(NTA, pref_v_amt + &val[l]) • pref(NTA, pref_i_amt + &ind[l]) • pref(NONE, &x[ind[l+pref_x_amt]]) • *NTA refers to no temporal locality on all levels • *NONE refers to temporal locality on highest Level

Characteristics of Modern Architectures • High Set Associativity in Caches • 4-way L1, 8-way L2, 12-way L3 Itanium 2 • Multiple Load Store Units • Multiple Execution Units • Six Integer Execution Units on Itanium 2 • Two Floating Point Multiply-Add Execution Units in Itanium 2 Question: What if we broke the matrix into multiple streams of execution?

Parallel SpMV • Run different rows in different threads • Can do that on data parallel architectures (SIMD/VLIW, Itanium/GPU)? • What if rows have different length? • One row finishes, other are still running • Waiting threads keep processors idle • Can we avoid idleness? • Standard solution: Segmented scan

Segmented Scan • Multiple Segments (streams) of Simultaneous Execution • Single Loop with branches inside to check if we’ve reached the end of a row for each segment. • Reduces Loop Overhead • Good if average NZ/Row is small • Changes the Memory Access Patterns and can more efficiently use caches for some matrices • Future Work: Pass SpM×V through a cache simulator to observe cache behavior

Itanium 2 Results (1.3 GHz, Millennium Cluster)

Conclusions & Future Work • Optimizations studied are a good idea and should include this into OSKI • Develop Parallel / Multicore versions • Dual Core, Dual Socket Opterons, etc

Questions?

Extra Slides

Algorithm # 2: Segmented Scan 1x1x2 SegmentedScan Code type val : real[k]type ind : int[k]type ptr : int[m+1]type RowStart: int[VectorLength] r0  RowStart[0]r1  RowStart[1] nnz0  ptr[r0]nnz1  ptr[r1] EoR0  ptr[r0+1]EoR1  ptr[r1+1] • while nnz0 < SegmentLength do • y[r0]  y[r0] + val[nnz0] ∙ x[ind[nnz0]] • y[r1]  y[r1] + val[nnz1] ∙ x[ind[nnz1]] • if(nnz0 = EoR0) • r0++ • EoR0  ptr[r0+1] • if(nnz1 = EoR1) • r1++ • EoR1  ptr[r1+1] • nnz0  nnz0 + 1 • nnz1  nnz1 + 1

Measuring Performance • Measure Dense Performance (r,c) • Performance (Mflop/s) of dense matrix in sparse r x c blocked format • Estimate Fill Ratio (r,c), r,c • Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros) • Choose r,c that maximizes • Estimated Performance (r,c) =

References • G. Belloch, M. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, Carnegie Mellon University, 1993. • E.-J. Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May 2000. • E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY: Framework for optimizing sparse matrix-vector multiply. International Journal of High Performance Computing Applications, 18(1):135–158, February 2004. • R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick. Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply. Technical Report UCB/CSD-04-1335, University of California, Berkeley, Berkeley, CA, USA, June 2004. • Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical Report 90-20, NASA Ames Research Center, Moffett Field, CA, 1990. • A. Schwaighofer. A matlab interface to svm light to version 4.0.http://www.cis.tugraz.at/igi/aschwaig/software.html, 2004. • R. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, December 2003. • R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing. (to appear).

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures