260 likes | 372 Views
Enhancing GPU for Scientific Computing. Some thoughts. Outline. Motivation Related work BLAS Library Execution Model Benchmarks Recommendations. Motivation. GPU Computing Vector and Fragment Processor streaming (super)-computers enormous performance!
E N D
Enhancing GPU for Scientific Computing Some thoughts
Outline • Motivation • Related work • BLAS Library • Execution Model • Benchmarks • Recommendations
Motivation • GPU Computing • Vector and Fragment Processor streaming (super)-computers • enormous performance! • ATI 9700, NV30 • They have become programmable • Emerging application areas • Numerical Sim.[Schroder’03], Sorting, Genomics, etc. • Goal: Scientific Computing
Motivation • Most software built from small-efficient parts • Scientific apps built on top of s/w library routines • Harnessing GPU resources • Arithmetic Intensive • Data parallel • BLAS Library
Related work • Using non-programmable GPUs • [Erik’01] prog. vertex engine for lighting/morphing • [Oskin’02] vector processing using VP • [Ian’03] stream processing using FP • Problems : • Monolithic Big Programs • One of VP or FP • CPU – Passive Mode • No Cascading Loop-backs (Parallelism, Setup Times)
BLAS Library • BLAS (Basic Linear Algebra Subprograms) • Building blocks for vector and matrix operations • development of highly efficient linear algebra software • LINPACK and LAPACK • Operations • Scalar – Vector • Vector – Vector • Vector – Matrix • Matrix – Matrix
CPU All operations CPU VP FP CPU VP Non-matrix ops FP All operations Mapping • Operation processor • CPU/FP - All ops • VP - no memory access • Restricted data-flows • CPU FP • VP CPU
(Vectors, Vectors) Execution graph Vector Scalar Add Operation vAdd CPU • In this example, a Vector of length n is segmented into m other vectors of length 4 in the CPU function vsAdd. • The vertex program vsAdd.cg is loaded onto the vertex processor and the scalar value is passed as a parameter. • Subsequently, CPU function vsAdd will stream the set of m vectors onto the CPU as openGL primitive points. Our vertex program, vsAdd.cg will add the scalar value to all fields in the m vertices. • Consequently, these vertices will proceed to the fragment processor and written onto the framebuffer memory. • The CPU function vsADD continues to read the color values off each pixel representation of the vertices. These color values contain result of a Vector Scalar add. • Lastly the CPU function concatenates the sequence of color values into a vector of length n as result. vAdd.cg [Vertex]m (GL_POINTS) [vAdd.cg] Vertex Processor [Vertex]m G P U [None] Fragment Processor Texture Mem PBuffer TextureDatam [Texture Color values]m vAdd CPU (Vectors)
Execution graph Vector Vector Add Operation GL_QUAD [Vector4]m vAdd CPU • In this example, 2 vectors of length s are transformed into texture data in the CPU function vAdd. • The vertex program vAdd.cg, and texture data are loaded onto the fragment processor GPU memory respectively. • Subsequently, CPU function vAdd will draw a quadrilateral primitive having s pixels. • The vertex processor does nothing and passes on the vertices to the rasterizer to process into pixel representation. • The rasterizer creates the s pixels for fragment processing. • For each pixel, our fragment processor will lookup the values from both textures and determine the color value of each pixel. These pixels are written onto the Pbuffer memory. • The CPU function vADD continues to read the color values off each pixel representation of the vertices. These color values contain result of a Vector Vector add. • The output in Pbuffer is then converted into a texture entry. • Lastly the CPU function reads the texture entry and concatenates the sequence of color values into a vector of length s as result. [Vertex4]m GL_QUAD vAdd.cg [None] Vertex Processor [Vertex4]m G P U [vAdd.cg] Fragment Processor TextureData1m TextureData2m Texture Mem PBuffer TextureData3m [Texture Color values]m vAdd CPU (Vectors)
Execution graph 2 Vector Vector Add Operations GL_QUAD [Vectex4]m vAdd CPU • In this example, we perform 2 separate vector vector add operations. • The 1st operation proceeds as described earlier in our vector vector add operation. • The output of the 1st operation is used as input for the 2nd operation. • Since it’s the same operation, we do not load a new Vertex or Fragment program. However we proceed to load a new texture data. • The 2nd operation proceeds as normal. • Lastly the CPU function concatenates the sequence of color values into a vector of length s as result. [Vertex4]m [None] Vertex Processor TextureData4m [Vertex4]m G P U [vAdd.cg] Fragment Processor Texture Mem PBuffer TextureData3m [Texture Color values]m vAdd CPU (Vectors)
Performance Issues • Representation inefficiency • Memory • Data stored both in CPU and GPU • Communication costs • Loading data onto GPU • Reading data from GPU • Execution inefficiencies • Computation setup overhead • Remodeling CPU data for GPU • Problem execution time • Rendering • Texture lookups
Observations • Fixed-point operations are much faster than FP16/FP32 operations • FP16/FP32 operations have similar performance • VP is slower than FP • Operation mappings involving both VP and FP result in inefficient pipeline
Observations • Simple operations perform better on CPU • Best to design whole algorithm as single VP/FP program • Memory cost for storing intermediate results • Execution cost ? • More textures result in decreased performance
Bug Reports Filed! • Incorrect dump of floating point values after render to texture [NVIDIA confirmed] • cgSetcolor parameter does not update alpha values [Awaiting reply]
Recommendations (3D Graphics Hackers) • Load important data into Video memory • Maximum use of Fixed-point Pipeline • Code optimization important (Instr., Memory) • Upgrade your video card drivers (must!) • Hacking graphics hardware is a *real* pain!
Recommendations (Cg) • Pointer meaningful for numerical computing • Texture fetch instructions (add. Offsets) • Accumulation registers (sum) • Preserving State across multiple calls • Introduce stack mechanisms • Introduce bit wise operators
Recommendations (Hardware) • Allow GPU to read/write from CPU memory • VP and FP as 1st class processors on GPU • Similar cores and instruction sets • Allow full parallelism • Allow CPU to read/write all registers in GPU processors • Introduce a stack • Introduce bit wise operators
Deliverables! • A draft subset of the BLAS library • Architecture Insights (issues/constraints) • NV30 Improvements (Bug reports) • Technical Write-up