410 likes | 674 Views
Modified from: A Survey of General-Purpose Computation on Graphics Hardware. John Owens University of California, Davis. David Luebke University of Virginia. with Naga Govindaraju, Mark Harris, Jens Kr ü ger, Aaron Lefohn, Tim Purcell. Motivation: The Potential of GPGPU. In short:
E N D
Modified from:A Survey of General-Purpose Computation on Graphics Hardware John Owens University of California, Davis David Luebke University of Virginia with Naga Govindaraju, Mark Harris, Jens Krüger, Aaron Lefohn, Tim Purcell
Motivation: The Potential of GPGPU • In short: • The power and flexibility of GPUs makes them an attractive platform for general-purpose computation • Example applications range from in-game physics simulation to conventional computational science • Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor
Problems: Difficult To Use • GPUs designed for & driven by video games • Programming model unusual • Programming idioms tied to computer graphics • Programming environment tightly constrained • Underlying architectures are: • Inherently parallel • Rapidly evolving (even in basic feature set!) • Largely secret • Can’t simply “port” CPU code!
STAR Goals • Detailed & useful survey of general-purpose computing on graphics hardware • Hardware and software developments behind GPGPU • Building blocks: techniques for mapping general-purpose computation to the GPU • Applications: important applications of GPGPU • A comprehensive GPGPU bibliography
NVIDIA GeForce 6800 3D Pipeline Vertex Triangle Setup Z-Cull Shader Instruction Dispatch Fragment L2 Tex Fragment Crossbar Composite Memory Partition Memory Partition Memory Partition Memory Partition Courtesy Nick Triantos, NVIDIA
Application specifies geometry rasterized • Each fragment is shaded w/ SIMD program • Shading can use values from texture memory • Image can be used as texture on future passes Programming a GPU for Graphics
Draw a screen-sized quad stream • Run a SIMD kernel over each fragment • “Gather” is permitted from texture memory • Resulting buffer can be treated as texture on next pass Programming a GPU for GP Programs
Feedback • Each algorithm step depend on the results of previous steps • Each time step depends on the results of the previous time step
CPU-GPU Analogies . . . Grid[i][j]= x; . . . Array Write = Render to Texture CPU GPU
CPU-GPU Analogies CPU GPU Stream / Data Array = Texture Memory Read = Texture Sample
Kernels Kernel / loop body / algorithm step = Fragment Program CPU GPU
Scatter vs. Gather • Grid communication • Grid cells share information • Gather • Indirect read from memory ( x = a[i]) • Naturally maps to a texture fetch • Used to access data structures and data streams • Scatter • Indirect write to memory (a[i] = x) • Difficult to emulate: • Usually done on CPU
Computational Resources Inventory • Programmable parallel processors • Vertex & Fragment pipelines • Rasterizer • Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants • Texture unit • Read-only memory interface • Render to texture • Write-only memory interface
Vertex Processor • Fully programmable (SIMD / MIMD) • Processes 4-vectors (RGBA / XYZW) • Capable of scatter but not gather • Can change the location of current vertex • Cannot read info from other vertices • Can only read a small constant memory • Latest GPUs: Vertex Texture Fetch • Random access memory for vertices • Gather (But not from the vertex stream itself)
Fragment Processor • Fully programmable (SIMD) • Processes 4-component vectors (RGBA / XYZW) • Random access memory read (textures) • Capable of gather but not scatter • RAM read (texture fetch), but no RAM write • Output address fixed to a specific pixel • Typically more useful than vertex processor • More fragment pipelines than vertex pipelines • Direct output (fragment processor is at end of pipeline)
GPGPU Building Blocks • fundamental techniques & computational building blocks: • Flow control (a very fundamental building block) • Stream operations • Data structures • Differential equations & linear algebra • Data queries
Flow control • Surprising number of issues on GPUs • Main themes: • Avoid branching when possible • Move branching earlier in the pipeline when possible • Largely SIMD coherent branching most efficient • Mechanisms: • Rasterized geometry • Z-cull • Occlusion query
Domain Decomposition • Avoid branches where outcome is fixed • One region is always true, another false • Separate FPs for each region, no branches • Example: boundaries
Flat 3D Textures • Advantages • One texture update per operation • Better use of GPU parallelism • Non-power-of-two Textures • Quick simulation preview • Disadvantage • Must compute texture offsets
Staggered Simulation • Non-interactive application: • Simulate as fast as possible • Frame rate suffers 20ms
Staggered Simulation • Interactive frame rate! • Simulation still proceeds pretty fast 10 20ms
Z-Cull • In early pass, modify depth buffer • Write depth=0 for pixels that should not be modified by later passes • Write depth=1 for rest • Subsequent passes • Enable depth test (GL_LESS) • Draw full-screen quad at z=0.5 • Only pixels with previous depth=1 will be processed • Can also use early stencil test • Note: Depth replace disables ZCull
Pre-computation • Pre-compute anything that will not change every iteration! • Example: arbitrary boundaries • When user draws boundaries, compute texture containing boundary info for cells • Reuse that texture until boundaries modified • Combine with Z-cull for higher performance!
Stream Operations • Several stream operations in GPGPU toolkit: • Map: apply a function to every element in a stream • Reduce: use a function to reduce a stream to a smaller stream (often 1 element) • Scatter/gather: indirect read and write • Filter: select a subset of elements in a stream • Sort: order elements in a stream • Search: find a given element, nearest neighbors, etc
Simple Fire Effect • Blur and scroll upward • Trails of blur emerge from bright source ‘embers’ at the bottom VA VC VB VD
Cellular Automata • Great for generating noise and other animated patterns to use in blending • Game of Life in a Pixel Shader • Cell ‘state’ relative to the rules is computed at each texel • Dependent texture read • State accesses ‘rules’ table, which is a texture • Highly complex rules are easy! • The Rules • For a space that is 'populated': • Each cell with one or no neighbors dies, • as if by loneliness. • Each cell with four or more neighbors dies, • as if by overpopulation. • Each cell with two or three neighbors survives. • For a space that is 'empty' or 'unpopulated' • Each cell with three neighbors becomes populated
Lattice Computations • How far can we take them? • Anything we can describe with discrete PDE equations! • Discrete in space and time • Also other approximations
Approximate Methods • Several different approximations • Cellular Automata (CA) • Coupled Map Lattice (CML) • Lattice-Boltzmann Methods (LBM)
Coupled Map Lattice • Mapping: • Continuous state lattice nodes • Coupling: • Nodes interact with each other to produce new state according to specified rules
Coupled Map Lattice • CML introduced by Kaneko (1980s) • Used CML to study spatio-temporal chaos • Others adapted CML to physical simulation: • Boiling [Yanagita 1992] • Convection [Yanagita 1993] • Clouds [Yanagita 1997; Miyazaki 2001] • Chemical reaction-diffusion [Kapral ‘93] • Saltation (sand ripples / dunes) [ Nishimori ‘93] • And more
CML vs. CA • CML extends cellular automata (CA)
CML vs. CA • Continuous state is more useful • Discrete: physical quantities difficult • Must filter over many nodes to get “real” values • Continuous: physical quantities easy • Real physical values at each node • Temperature, velocity, concentration, etc.
Rules? • CML updated via simple, local rules • Simple: same rule applied at every cell (SIMD) • Local: cells updated according to some function of their neighbors’ state
Example: Buoyancy • Used in temperature-based boiling simulation • At each cell: • If neighbors to left and right of cell are warmer, raise the cell’s temperature • If neighbors are cooler, lower its temperature
CML Operations • Implement operations as building blocks for use in multiple simulations • Diffusion • Buoyancy (2 types) • Latent Heat • Advection • Viscosity / Pressure • Gray-Scott Chemical Reaction • Boundary Conditions • User interaction (drawing) • Transfer function (color gradient)
Anatomy of a CML operation • Neighbor Sampling • Select and read values, v, of nearby cells • Computation on Neighbors • Compute f(v) for each sample (f can be arbitrary computation) • Combine new values (arithmetic) • Store new values back in lattice
Graphics Hardware • Why use it? • Speed: up to 25x speedup in our sims • GPU perf. grows faster than CPU perf. • Cheap: GeForce 4 Ti 4200 < $130 • Load balancing in complex applications • Why not use it? • Low precision computation (not anymore!) • Difficult to program (not anymore!)
Simulating the world • Simulate a wide variety of phenomena on GPUs • Anything we can describe with discrete PDEs or approximations of PDEs