1 / 21

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures

This presentation explores Sparse Matrix-Vector Multiply (SpM×V) algorithms, optimizations, and applications using Compressed Sparse Row and Register-Level Blocking to enhance performance on modern architectures like Itanium 2. Techniques such as Explicit Software Pipelining and Prefetching are discussed to improve efficiency, with insights on Parallel SpMV and Segmented Scan approaches for multi-core systems. The study evaluates performance metrics like Mflop/s and Fill Ratio to guide algorithm design for optimal results on advanced architectures.

francis
Download Presentation

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007 ankit@eecs.berkeley.edu volkov@eecs.berkeley.edu

  2. SpM×V and its Applications vector: x • Sparse Matrix Vector Multiply (SpM×V): y y+A∙x • x, y are dense vectors • x: source vector • y: destination vector • A is a sparse matrix (<1% of entries are nonzero) • Applications employing SpM×V in the inner loop • Least Squares Problems • Eigenvalue Problems matrix: A vector: y

  3. Storing a Matrix in Memory • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i]  y[i] + val[l] ∙ x[ind[l]] Compressed Sparse Row Data Structure and Algorithm

  4. What’s so hard about it? • Reason for Poor Performance of Naïve Implementation • Poor locality (indirect and irregular memory accesses) • Limited by speed of main memory • Poor instruction mix (low flops to memory operations ratio) • Algorithm dependent on non-zero structure of matrix • Dense matrices vs Sparse matrices

  5. Register-Level Blocking (SPARSITY): 3x3 Example

  6. Register-Level Blocking (SPARSITY): 3x3 Example BCSR with uniform, aligned grid

  7. Register-Level Blocking (SPARSITY): 3x3 Example Fill-in zeros: trade-off extra ops for better efficiency

  8. Blocked Compressed Sparse Row • Inner loop performs floating point multiply-add on each non-zero in block instead of just one non-zero • Reduces the number of times the source vector x has to be brought back into memory • Reduces the number of indices that have to be stored and loaded

  9. Mflop/s Best: 4x2 Reference Mflop/s The Payoff: Speedups on Itanium 2

  10. Explicit Software Pipelining • ORIGINAL CODE: • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i]  y[i] + val[l] ∙ x[ind[l]] • SOFTWARE PIPELINED CODE • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i]  y[i] + val_1 ∙ x_1 • val_1 = val[l + 1] • x_1 = x[ind_2] • ind_2 = ind[l + 2]

  11. Explicit Software Prefetching • ORIGINAL CODE: • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i]  y[i] + val[l] ∙ x[ind[l]] • SOFTWARE PREFETCHED CODE • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i]  y[i] + val[l] ∙ x[ind[l]] • pref(NTA, pref_v_amt + &val[l]) • pref(NTA, pref_i_amt + &ind[l]) • pref(NONE, &x[ind[l+pref_x_amt]]) • *NTA refers to no temporal locality on all levels • *NONE refers to temporal locality on highest Level

  12. Characteristics of Modern Architectures • High Set Associativity in Caches • 4-way L1, 8-way L2, 12-way L3 Itanium 2 • Multiple Load Store Units • Multiple Execution Units • Six Integer Execution Units on Itanium 2 • Two Floating Point Multiply-Add Execution Units in Itanium 2 Question: What if we broke the matrix into multiple streams of execution?

  13. Parallel SpMV • Run different rows in different threads • Can do that on data parallel architectures (SIMD/VLIW, Itanium/GPU)? • What if rows have different length? • One row finishes, other are still running • Waiting threads keep processors idle • Can we avoid idleness? • Standard solution: Segmented scan

  14. Segmented Scan • Multiple Segments (streams) of Simultaneous Execution • Single Loop with branches inside to check if we’ve reached the end of a row for each segment. • Reduces Loop Overhead • Good if average NZ/Row is small • Changes the Memory Access Patterns and can more efficiently use caches for some matrices • Future Work: Pass SpM×V through a cache simulator to observe cache behavior

  15. Itanium 2 Results (1.3 GHz, Millennium Cluster)

  16. Conclusions & Future Work • Optimizations studied are a good idea and should include this into OSKI • Develop Parallel / Multicore versions • Dual Core, Dual Socket Opterons, etc

  17. Questions?

  18. Extra Slides

  19. Algorithm # 2: Segmented Scan 1x1x2 SegmentedScan Code type val : real[k]type ind : int[k]type ptr : int[m+1]type RowStart: int[VectorLength] r0  RowStart[0]r1  RowStart[1] nnz0  ptr[r0]nnz1  ptr[r1] EoR0  ptr[r0+1]EoR1  ptr[r1+1] • while nnz0 < SegmentLength do • y[r0]  y[r0] + val[nnz0] ∙ x[ind[nnz0]] • y[r1]  y[r1] + val[nnz1] ∙ x[ind[nnz1]] • if(nnz0 = EoR0) • r0++ • EoR0  ptr[r0+1] • if(nnz1 = EoR1) • r1++ • EoR1  ptr[r1+1] • nnz0  nnz0 + 1 • nnz1  nnz1 + 1

  20. Measuring Performance • Measure Dense Performance (r,c) • Performance (Mflop/s) of dense matrix in sparse r x c blocked format • Estimate Fill Ratio (r,c), r,c • Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros) • Choose r,c that maximizes • Estimated Performance (r,c) =

  21. References • G. Belloch, M. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, Carnegie Mellon University, 1993. • E.-J. Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May 2000. • E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY: Framework for optimizing sparse matrix-vector multiply. International Journal of High Performance Computing Applications, 18(1):135–158, February 2004. • R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick. Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply. Technical Report UCB/CSD-04-1335, University of California, Berkeley, Berkeley, CA, USA, June 2004. • Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical Report 90-20, NASA Ames Research Center, Moffett Field, CA, 1990. • A. Schwaighofer. A matlab interface to svm light to version 4.0.http://www.cis.tugraz.at/igi/aschwaig/software.html, 2004. • R. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, December 2003. • R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing. (to appear).

More Related