250 likes | 381 Views
A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware. Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys. University of Virginia. General-Purpose GPU Programming. Why do we port algorithms to the GPU?
E N D
A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory LewinDavid Luebke Greg Humphreys University of Virginia
General-Purpose GPU Programming • Why do we port algorithms to the GPU? • How much faster can we expect it to be, really? • What is the challenge in porting?
Case Study Problem: Implement a Boundary Value Problem (BVP) solver using the GPU Could benefit an entire class of scientific and engineering applications, e.g.: • Heat transfer • Fluid flow
Related Work • Krüger and Westermann: Linear Algebra Operators for GPU Implementation of Numerical Algorithms • Bolz et al.: Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid • Very similar to our system • Developed concurrently • Complementary approach
Driving problem: Fluid mechanics sim Problem domain is a warped disc: regular grid regular grid
BVPs: Background • Boundary value problems are sometimes governedby PDEs of the form: L=f • L is some operator • is the problem domain • f is a forcing function (source term) • Given L and f, solve for .
BVPs: Example Heat Transfer • Find a steady-state temperature distribution T in a solid of thermal conductivity k with thermal source S • This requires solving a Poisson equation of the form: k2T = -S • This is a BVP where L is the Laplacian operator 2 All our applications require a Poisson solver.
BVPs: Solving • Most such problems cannot be solved analytically • Instead, discretize onto a grid to form a set of linear equations, then solve: • Direct elimination • Gauss-Seidel iteration • Conjugate-gradient • Strongly implicit procedures • Multigrid method
Multigrid method • Iteratively corrects an approximation to the solution • Operates at multiple grid resolutions • Low-resolution grids are used to correct higher-resolution grids recursively • Very fast, especially for large grids: O(n)
1 1/8 1/2 1/4 1/4 1/16 1/16 1 -4 1 1 1/8 1/4 1/8 1/2 1/2 1 1/4 1/8 1/4 1/16 1/2 1/16 Multigrid method • Use coarser grid levels to recursively correct an approximation to the solution • Algorithm: • smooth • residual • restrict • recurse • interpolate = Li-f
Implementation For each step of the algorithm: • Bind as texture maps the buffers that contain the necessary data • Set the target buffer for rendering • Activate a fragment program that performs the necessary kernel computation • Render a grid-sized quad with multitexturing source buffer texture source buffer texture render target buffer render target buffer fragment program
Optimizing the Solver • Detect steady-state natively on GPU • Minimize shader length • Special-case whenever possible • Avoid context-switching
Optimizing the Solver: Steady-state • How to detect convergence? • L1 norm - average error • L2 norm – RMS error (common in visual sim) • L norm – max error (common in sci/eng apps) • Can use occlusion query! secs to steady statevs. grid size
Optimizing the Solver: Shader length • Minimize number of registers used • Vectorize as much as possible • Use the rasterizer to perform computations of linearly-varying values • Pre-compute invariants on CPU
Optimizing the Solver: Special-case • Fast-path vs. slow-path • write several variants of each fragment program to handle boundary cases • eliminates conditionals in the fragment program • equivalent to avoiding CPU inner-loop branching fast path, no boundaries slow path with boundaries
Optimizing the Solver: Special-case • Fast-path vs. slow-path • write several variants of each fragment program to handle boundary cases • eliminates conditionals in the fragment program • equivalent to avoiding CPU inner-loop branching secs per v-cyclevs. grid size
Optimizing the Solver: Context-switching • Find best packing data of multiple grid levelsinto the pbuffer surfaces
Optimizing the Solver: Context-switching • Find best packing data of multiple grid levelsinto the pbuffer surfaces
Optimizing the Solver: Context-switching • Find best packing data of multiple grid levelsinto the pbuffer surfaces
Optimizing the Solver: Context-switching • Remove context switching • Can introduce operations with undefined results: reading/writing same surface • Why do we need to do this? • Can we get away with it? • What about superbuffers?
Data Layout • Performance: secs to steady statevs. grid size
Compute 4 values at a time Requires source, residual, solution values to be in different buffers Complicates boundary calculations Adds setup and teardown overhead Data Layout • Possible additional vectorization: Stacked domain
Results: CPU vs. GPU • Performance: secs to steady statevs. grid size
Conclusions What we need going forward: • Superbuffers • or: Universal support for multiple-surface pbuffers • or: Cheap context switching • Developer tools • Debugging tools • Documentation • Global accumulator • Ever increasing amounts of precision, memory • Textures bigger than 2048 on a side
Hardware David Kirk Matt Papakipos Driver Support Nick Triantos Pat Brown Stephen Ehmann Fragment Programming James Percy Matt Pharr General-purpose GPU Mark Harris Aaron Lefohn Ian Buck Funding NSF Award #0092793 Acknowledgements