210 likes | 229 Views
This presentation explores Sparse Matrix-Vector Multiply (SpM×V) algorithms, optimizations, and applications using Compressed Sparse Row and Register-Level Blocking to enhance performance on modern architectures like Itanium 2. Techniques such as Explicit Software Pipelining and Prefetching are discussed to improve efficiency, with insights on Parallel SpMV and Segmented Scan approaches for multi-core systems. The study evaluates performance metrics like Mflop/s and Fill Ratio to guide algorithm design for optimal results on advanced architectures.
E N D
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007 ankit@eecs.berkeley.edu volkov@eecs.berkeley.edu
SpM×V and its Applications vector: x • Sparse Matrix Vector Multiply (SpM×V): y y+A∙x • x, y are dense vectors • x: source vector • y: destination vector • A is a sparse matrix (<1% of entries are nonzero) • Applications employing SpM×V in the inner loop • Least Squares Problems • Eigenvalue Problems matrix: A vector: y
Storing a Matrix in Memory • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i] y[i] + val[l] ∙ x[ind[l]] Compressed Sparse Row Data Structure and Algorithm
What’s so hard about it? • Reason for Poor Performance of Naïve Implementation • Poor locality (indirect and irregular memory accesses) • Limited by speed of main memory • Poor instruction mix (low flops to memory operations ratio) • Algorithm dependent on non-zero structure of matrix • Dense matrices vs Sparse matrices
Register-Level Blocking (SPARSITY): 3x3 Example BCSR with uniform, aligned grid
Register-Level Blocking (SPARSITY): 3x3 Example Fill-in zeros: trade-off extra ops for better efficiency
Blocked Compressed Sparse Row • Inner loop performs floating point multiply-add on each non-zero in block instead of just one non-zero • Reduces the number of times the source vector x has to be brought back into memory • Reduces the number of indices that have to be stored and loaded
Mflop/s Best: 4x2 Reference Mflop/s The Payoff: Speedups on Itanium 2
Explicit Software Pipelining • ORIGINAL CODE: • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i] y[i] + val[l] ∙ x[ind[l]] • SOFTWARE PIPELINED CODE • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i] y[i] + val_1 ∙ x_1 • val_1 = val[l + 1] • x_1 = x[ind_2] • ind_2 = ind[l + 2]
Explicit Software Prefetching • ORIGINAL CODE: • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i] y[i] + val[l] ∙ x[ind[l]] • SOFTWARE PREFETCHED CODE • type val : real[k] type ind : int[k] type ptr : int[m+1] • foreach row i do • for l = ptr[i] to ptr[i + 1] – 1 do • y[i] y[i] + val[l] ∙ x[ind[l]] • pref(NTA, pref_v_amt + &val[l]) • pref(NTA, pref_i_amt + &ind[l]) • pref(NONE, &x[ind[l+pref_x_amt]]) • *NTA refers to no temporal locality on all levels • *NONE refers to temporal locality on highest Level
Characteristics of Modern Architectures • High Set Associativity in Caches • 4-way L1, 8-way L2, 12-way L3 Itanium 2 • Multiple Load Store Units • Multiple Execution Units • Six Integer Execution Units on Itanium 2 • Two Floating Point Multiply-Add Execution Units in Itanium 2 Question: What if we broke the matrix into multiple streams of execution?
Parallel SpMV • Run different rows in different threads • Can do that on data parallel architectures (SIMD/VLIW, Itanium/GPU)? • What if rows have different length? • One row finishes, other are still running • Waiting threads keep processors idle • Can we avoid idleness? • Standard solution: Segmented scan
Segmented Scan • Multiple Segments (streams) of Simultaneous Execution • Single Loop with branches inside to check if we’ve reached the end of a row for each segment. • Reduces Loop Overhead • Good if average NZ/Row is small • Changes the Memory Access Patterns and can more efficiently use caches for some matrices • Future Work: Pass SpM×V through a cache simulator to observe cache behavior
Conclusions & Future Work • Optimizations studied are a good idea and should include this into OSKI • Develop Parallel / Multicore versions • Dual Core, Dual Socket Opterons, etc
Algorithm # 2: Segmented Scan 1x1x2 SegmentedScan Code type val : real[k]type ind : int[k]type ptr : int[m+1]type RowStart: int[VectorLength] r0 RowStart[0]r1 RowStart[1] nnz0 ptr[r0]nnz1 ptr[r1] EoR0 ptr[r0+1]EoR1 ptr[r1+1] • while nnz0 < SegmentLength do • y[r0] y[r0] + val[nnz0] ∙ x[ind[nnz0]] • y[r1] y[r1] + val[nnz1] ∙ x[ind[nnz1]] • if(nnz0 = EoR0) • r0++ • EoR0 ptr[r0+1] • if(nnz1 = EoR1) • r1++ • EoR1 ptr[r1+1] • nnz0 nnz0 + 1 • nnz1 nnz1 + 1
Measuring Performance • Measure Dense Performance (r,c) • Performance (Mflop/s) of dense matrix in sparse r x c blocked format • Estimate Fill Ratio (r,c), r,c • Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros) • Choose r,c that maximizes • Estimated Performance (r,c) =
References • G. Belloch, M. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, Carnegie Mellon University, 1993. • E.-J. Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May 2000. • E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY: Framework for optimizing sparse matrix-vector multiply. International Journal of High Performance Computing Applications, 18(1):135–158, February 2004. • R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick. Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply. Technical Report UCB/CSD-04-1335, University of California, Berkeley, Berkeley, CA, USA, June 2004. • Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical Report 90-20, NASA Ames Research Center, Moffett Field, CA, 1990. • A. Schwaighofer. A matlab interface to svm light to version 4.0.http://www.cis.tugraz.at/igi/aschwaig/software.html, 2004. • R. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, December 2003. • R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing. (to appear).