160 likes | 297 Views
On Using Graphics Hardware for Scientific Computing ________________________________________________. Stan Tomov June 23, 2006. Outline. Motivation Literature review The graphics pipeline Programmable GPUs Some application examples Performance results Conclusion. Motivation.
E N D
On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006
Outline • Motivation • Literature review • The graphics pipeline • Programmable GPUs • Some application examples • Performance results • Conclusion
Motivation Table 1. GPU vs CPU in rendering polygons. The GPU (Quadro2 Pro) is approximately 30 times faster than the CPU (Pentium III, 1 GHz) in rendering polygonal data of various sizes.
Motivation • High flops count (currently 200GFlops, single precision) • Compatible price performance (less then 1 cent per MFlop) • Performance doubling every 6 months • Continuously increasing functionality and programmability • Realistic games require more complicated physics (picture: from the GPU Gems 2 book)
Literature review Using graphics hardware for non-graphics applications (just a few examples): • Cellular automata • Reaction-diffusion simulation (Mark Harris, University of North Carolina) • Matrix multiply (E. Larsen and D. McAllister, University of North Carolina) • Lattice Boltzmann (Wei Li, Xiaoming Wei, and Arie Kaufman, Stony Brook) • CG and multigrid (J. Bolz et al, Caltech, and N. Goodnight et al, University of Virginia) • Convolution (University of Stuttgart) • BLAS 1,2; fft; certain eigensolvers; etc. • See also GPGPU’s homepage : http://www.gpgpu.org/
Literature review Typical performance results reported (by the middle of 2003): • Significant speedup of GPU vs CPU are reported if the GPU performs low precision computations (30 to 60 times; depends on the configuration) - integers (8 or 12 bit arithmetic), 16-bit floating point • Vendor advertisements about very high performance assume low precision arithmetic • NCSA, University of Illinois assembled a $50,000 supercomputer out of 70 PlayStation 2 consoles, which could theoretically deliver 0.5 trillion operations/second • GPU’s 32-bit flops performance is comparable to the CPU’s(may be 2-4 times faster depending on application and configuration)
The graphics pipeline • GeForce 256 (August, 1999) - allowed certain degree of programmability - before: fixed function pipeline • GeForce 3 (February, 2001) - considered first fully programmable GPU • GeForce 4 - partial 16-bit floating point arithmetic • NV30 - 32-bit floating point • Cg - high-level programming language
The graphics pipeline • GPUs: on their way into turning into programmable stream processors • Stream formulation of the graphics pipeline:all data viewed as streams and computation as kernels • Streaming • Efficient computation (enable efficient parallelism; deep pipeline) • Efficient communication (efficient off-chip communication; intermediate results kept on chip; deep pipelining allows high degree of latency tolerance (picture: from the GPU Gems 2 book)
Programmable GPUs (in particular NV30) • GPU programming model: streaming • Naturally addresses parallelism and communication • Easy when problems maps well • Support floating point operations • Vertex program • Replaces fixed-function pipeline for vertices • Manipulates single vertex data • Executes for every vertex • Fragment program • Similar to vertex program but for pixels • Programming in Cg: • High level language; looks like C; portable; compiles Cg programs to assembly code
Block Diagram of GeForce FX • AGP 8x graphics bus bandwidth: 2.1GB/s • Local memory bandwidth: 16 GB/s • Chip officially clocked at 500 MHz • Vertex processor: - execute vertex shaders or emulate fixed transformations and lighting (T&L) • Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle • Texture & color interpolators - interpolate texture coordinates and color values • Performance (on processing 4D vectors): • Vertex ops/sec - 1.5 Gops • Pixel ops/sec - 8 Gops (int), or 4 Gops (float) Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002.
Block Diagram of GeForce FX 3 vertex and 8 pixel processors Last nVidia card: dual-GPU GeForce 7950 GX2 with 32 vertex and 96 pixel processors • AGP 8x graphics bus bandwidth: 2.1GB/s • Local memory bandwidth: 16 GB/s • Chip officially clocked at 500 MHz • Vertex processor: - execute vertex shaders or emulate fixed transfor- mations and lighting (T&L) • Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle • Texture & color interpolators - interpolate texture coordinates and color values • Performance (on processing 4D vectors): • Vertex ops/sec - 1.5 Gops • Pixel ops/sec - 8 Gops (int), or 4 Gops (float) Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002.
Summary of CPU vs GPU • General vs specialized hardware • CPUs have more complex control hardware • GPU can have hardware acceleration for specific tasks • Sequential vs parallel programming models • In general CPUs don’t have the GPU’s level of data parallelism (though some may be available: Intel’s SSE and PowerPC’s AltiVec instructions sets) • Memory latency vs bandwidth optimization
Some application examples • Monte Carlo simulations • Used in variety of simulations in physics, finance, chemistry, etc. • Based on probability statistics and use random numbers • A classical example: compute area of a circle • Computation of expected values: • N can be very large : on a 1024 x 1024 lattice of particles, every particle modeled to have k states, N = • Random number generation. We used linear congruential type generator:
Some application examples • Monte Carlo simulations • Ising model • Simplified model for magnets • Evolve the system into “higher probability” states and compute expected values as average over only those states • Percolation • In studies of disease spreading, flow in porous media, forest fire propagation, clustering, etc. • Lattice Boltzmann method • Simulate fluid flow; particles are allowed to move and collide on a lattice
Some performance results • saxpy on 512 x 512 (x 4) vectors 1GFlop • speed limited by GPU memory bandwidth (16 GB/s) • sin, cos, exp, log 20 times faster than on Pentium 4, 2.8GHz • hardware accelerated of low accuracy • Ising model 7GFlops • 44% of theoretical maximum • On fragment program compiled to 109 assembly instructions
Conclusions • What to expect for future GPGPUs?Can GPGPUs influence future computer systems ? ( HPC and consequently our models of software development: is the IBM’s Cell processor already an example? ) Current trends: CPU multi-core GPU more powerful streaming model (Gather, scatter, conditional streams, reduction, etc.) more CPU functionality