Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder Caltech ASCI Center

Actual Possible Why Use the GPU? • Semiconductor trends • cost • wires vs. compute • Stanford streaming supercomputer • Parallelism • many functional units • graphics is prime example • Harvesting this power • what application suitable? • what abstractions useful? • History • massively parallel SIMD machines • media processing Chart courtesy Bill Dally Imagine stream processor; Bill Dally, Stanford Connection Machine CM2; Thinking Machines

Contributions and Related Work • Contributions • numerical algorithms on GPU • unstructured grids: conjugate gradients • regular grids: multigrid • what abstractions are needed? • Numerical algorithms • Goodnight et al. 2003 (MG) • Hall et al. 2003 (cache) • Harris et al. 2002 (FD sim.) • Hillisland et al. 2003 (optimization) • Krueger & Westermann 2003 (NLA) • Strzodka (PDEs)

Rasterizer (set up texture indices and all associated data) Kernel Kernel Texture as read-only memory Fragment program (for all pixels in parallel) Output goes to texture Bind buffer to texture Streaming Model output record stream • Abstract model • Purcell, et al. 2002 • data structures: streams • algorithms: kernels • Concrete model • render a rectangle • data structures: textures • algorithms: fragment programs input record stream globals globals

Velocity opposite mean curvature normal Sparse Matrices: Geometric Flow • Ubiquitous in numerical computing • discretization of PDEs: animation • finite elements, difference, volumes • optimization, editing, etc., etc. • Example here: • processing of surfaces • Canonical non-linear problem • mean curvature flow • implicit time discretization • solve sequence of SPD systems

Conjugate Gradients • High level code • inner loop • matrix-vectormultiply • sum-reduction • scalar-vector MAD • Inner product • fragment-wise multiply • followed by sum-reduction • odd dimensions can be handled

y=Ax Aj – off-diagonal matrix elements R – pointers to segments

Row-Vector Product X – vector elements J – pointers to xj R – pointers to segments Fragment program Aj – off-diagonal matrix elements Ai – diagonal matrix elements

Time Area (pixels) Apply to All Pixels • Two extremes • one row at a time: setup overhead • all rows at once: limited by worst row • Middle ground • organize “batches” of work • How to arrange batches? • order rows by non-zero entries • optimal packing NP hard • We choose fixed size rectangles • fragment pipe is quantized • simple experiments reveal best size • 26 x 18 – 91% efficient • wasted fragments on diagonal

each batch bound to an appropriate fragment program Packing (Greedy) … 15 13 13 9 9 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 6 5 5 4 12 12 11 10 9 9 non-zero entries per row 15 13 13 9 9 8 7 7 7 12 12 11 8 8 8 7 7 7 10 9 9 8 7 7 7 7 6 All this setup done once only at the beginning of time. Depends only on mesh connectivity

Recomputing Matrix • Matrix entries depend on surface • must “render” into matrix • two additional indirection textures • previous and next

Results (NV30@500MHz) • 37k elements • matrix multiply • 33 instructions, 120 per second • only 13 flops • latency limited • reduction • 7 inst/frag/pass, 3400 per second • CG solve: 20 per second

Regular Grids • Poisson solver as example • multigrid approach • this time variables on “pixel grid” • e.g.: Navier-Stokes after discretization: solve Poisson eq. at each time step

0 1 0 1 -4 1 (i,j) 1 0 0 Poisson Equation • Appears all over the place • easy to discretize on regular grid • matrix multiply isstencil application • FD Laplace stencil: • Use iterative matrix solver • just need application of stencil • easy: just like filtering • incorporate geometry (Jacobian) • variable coefficients

Projection Interpolation Interpolation Projection Relax Relax Relax Relax Relax Multigrid • Fine to coarse to fine cycle • high freq. error removed quickly • lower frequency error takes longer Relax, Project, Interpolate

1 1 2 2 4 2 1/16 2 1 1 x y z w Computations and Storage Layout • Lots of stencil applications • matrix multiply: 3x3 stencil • projection: 3x3 stencil • interpolation: 2x2(!) • floor op in indexing • Storage for matrices and DOFs • variables in one texture • matrices in 9(=3x3) textures • all textures packed • exploit 4 channels • domain decomp. • padded boundary

= Ac P Af S Coarser Matrices • Operator at coarser level • needed for relaxation at all levels • Triple matrix product… • work out terms and map to stencils • exploit local support of stencils • straightforward but t-e-d-i-o-u-s

Results (NV30@500MHz) • 257x257 grid • matrix multiply - 27 instructions • 1370 per second • interpolation 10 inst. • projection 19 inst. • Overall performance • 257x257 at 80 fps!

Conclusions • Enhancements • global registers for reductions • texture fetch with offset • rectangular texture border • scalar versus vector problems • Where are we now? • good streaming processor • twice as fast as CPU implementation • lots of room for improvement • Scientific computing compiler • better languages! Brook? C*? • manage layout in a buffer

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Presentation Transcript

Dense Matrix Algorithms

Dense Matrix Algorithms

Implementing the Self-Sufficiency Matrix

CRUD Matrix

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

The Question Matrix Q Matrix

Oregon Productivity Matrix (OPM) Supplier Briefing

THE SCIENCE OF THE SUDOKU SOLVER

All-Pairs shortest paths via fast matrix multiplication

Sparsity-Based Signal Models and the Sparse K-SVD Algorithm

Propositum : DWBAT conjugate sum, esse in the Present and Imperfect tenses

Solvers: Equality and Arrays Lecture 4 2012

Minimizing Communication in Linear Algebra

Automatic Performance Tuning of Sparse Matrix Kernels: Recent Progress

Sparse Representation for Image Reconstruction, and Face Recognition?

Experiences with QBF Solvers

Configuring and Troubleshooting Access

Statistical Models in S

Welcome to Matrix Controls and MKMS Matrix Knit Monitoring System

THE MEETINGS MATRIX

Automatic Performance Tuning Sparse Matrix Kernels

Graph Data Structures