1 / 19

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schr öder Caltech ASCI Center. Actual. Possible. Why Use the GPU?. Semiconductor trends cost wires vs. compute Stanford streaming supercomputer Parallelism

dudley
Download Presentation

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder Caltech ASCI Center

  2. Actual Possible Why Use the GPU? • Semiconductor trends • cost • wires vs. compute • Stanford streaming supercomputer • Parallelism • many functional units • graphics is prime example • Harvesting this power • what application suitable? • what abstractions useful? • History • massively parallel SIMD machines • media processing Chart courtesy Bill Dally Imagine stream processor; Bill Dally, Stanford Connection Machine CM2; Thinking Machines

  3. Contributions and Related Work • Contributions • numerical algorithms on GPU • unstructured grids: conjugate gradients • regular grids: multigrid • what abstractions are needed? • Numerical algorithms • Goodnight et al. 2003 (MG) • Hall et al. 2003 (cache) • Harris et al. 2002 (FD sim.) • Hillisland et al. 2003 (optimization) • Krueger & Westermann 2003 (NLA) • Strzodka (PDEs)

  4. Rasterizer (set up texture indices and all associated data) Kernel Kernel Texture as read-only memory Fragment program (for all pixels in parallel) Output goes to texture Bind buffer to texture Streaming Model output record stream • Abstract model • Purcell, et al. 2002 • data structures: streams • algorithms: kernels • Concrete model • render a rectangle • data structures: textures • algorithms: fragment programs input record stream globals globals

  5. Velocity opposite mean curvature normal Sparse Matrices: Geometric Flow • Ubiquitous in numerical computing • discretization of PDEs: animation • finite elements, difference, volumes • optimization, editing, etc., etc. • Example here: • processing of surfaces • Canonical non-linear problem • mean curvature flow • implicit time discretization • solve sequence of SPD systems

  6. Conjugate Gradients • High level code • inner loop • matrix-vectormultiply • sum-reduction • scalar-vector MAD • Inner product • fragment-wise multiply • followed by sum-reduction • odd dimensions can be handled

  7. y=Ax Aj – off-diagonal matrix elements R – pointers to segments

  8. Row-Vector Product X – vector elements J – pointers to xj R – pointers to segments Fragment program Aj – off-diagonal matrix elements Ai – diagonal matrix elements

  9. Time Area (pixels) Apply to All Pixels • Two extremes • one row at a time: setup overhead • all rows at once: limited by worst row • Middle ground • organize “batches” of work • How to arrange batches? • order rows by non-zero entries • optimal packing NP hard • We choose fixed size rectangles • fragment pipe is quantized • simple experiments reveal best size • 26 x 18 – 91% efficient • wasted fragments on diagonal

  10. each batch bound to an appropriate fragment program Packing (Greedy) … 15 13 13 9 9 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 6 5 5 4 12 12 11 10 9 9 non-zero entries per row 15 13 13 9 9 8 7 7 7 12 12 11 8 8 8 7 7 7 10 9 9 8 7 7 7 7 6 All this setup done once only at the beginning of time. Depends only on mesh connectivity

  11. Recomputing Matrix • Matrix entries depend on surface • must “render” into matrix • two additional indirection textures • previous and next

  12. Results (NV30@500MHz) • 37k elements • matrix multiply • 33 instructions, 120 per second • only 13 flops • latency limited • reduction • 7 inst/frag/pass, 3400 per second • CG solve: 20 per second

  13. Regular Grids • Poisson solver as example • multigrid approach • this time variables on “pixel grid” • e.g.: Navier-Stokes after discretization: solve Poisson eq. at each time step

  14. 0 1 0 1 -4 1 (i,j) 1 0 0 Poisson Equation • Appears all over the place • easy to discretize on regular grid • matrix multiply isstencil application • FD Laplace stencil: • Use iterative matrix solver • just need application of stencil • easy: just like filtering • incorporate geometry (Jacobian) • variable coefficients

  15. Projection Interpolation Interpolation Projection Relax Relax Relax Relax Relax Multigrid • Fine to coarse to fine cycle • high freq. error removed quickly • lower frequency error takes longer Relax, Project, Interpolate

  16. 1 1 2 2 4 2 1/16 2 1 1 x y z w Computations and Storage Layout • Lots of stencil applications • matrix multiply: 3x3 stencil • projection: 3x3 stencil • interpolation: 2x2(!) • floor op in indexing • Storage for matrices and DOFs • variables in one texture • matrices in 9(=3x3) textures • all textures packed • exploit 4 channels • domain decomp. • padded boundary

  17. = Ac P Af S Coarser Matrices • Operator at coarser level • needed for relaxation at all levels • Triple matrix product… • work out terms and map to stencils • exploit local support of stencils • straightforward but t-e-d-i-o-u-s

  18. Results (NV30@500MHz) • 257x257 grid • matrix multiply - 27 instructions • 1370 per second • interpolation 10 inst. • projection 19 inst. • Overall performance • 257x257 at 80 fps!

  19. Conclusions • Enhancements • global registers for reductions • texture fetch with offset • rectangular texture border • scalar versus vector problems • Where are we now? • good streaming processor • twice as fast as CPU implementation • lots of room for improvement • Scientific computing compiler • better languages! Brook? C*? • manage layout in a buffer

More Related