1 / 40

Linear Algebra – I

Linear Algebra – I. Sashi Kumar Penta GPGP class presentation Spring 2007. LA – 1: Outline. Motivation Dense matrices and vectors Representation Vector-Vector operations Vector reduction type operations Matrix-Matrix multiplication Larsen et. al. Fatahalian et. al.

cais
Download Presentation

Linear Algebra – I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Algebra – I Sashi Kumar Penta GPGP class presentation Spring 2007

  2. LA – 1: Outline • Motivation • Dense matrices and vectors • Representation • Vector-Vector operations • Vector reduction type operations • Matrix-Matrix multiplication • Larsen et. al. • Fatahalian et. al. • Naga Govindaraju et. al. • Introduction to CUDA • Matrix multiplication

  3. LA – 2: Outline • Memory requirements for Balanced computer architectures • Sparse matrices • Banded matrices • Random sparse matrices • Sparse Matrix-Vector multiplication • Solving Navier - strokes on GPU • …

  4. Why LA on GPUs? • Applications of LA • Scientific computations • Search (LSI) • Graphics Simulations • The GPU is a fast streaming processor • LA operations are easily “streamable” • The result of the computation is already on the GPU and ready for display • For simulations

  5. Arrays = textures • 2D arrays are the native GPU data layout • Maximum array size in GPU 8K x 8K • Texture coordinates to access values stored in the textures

  6. N Vectors Matrix i N 2D-Textures 1 N 1 i N ... ... ... ... N N 1 N N 1 i N Representation Courtesy Jens Krüger

  7. Matrix multiplication: Why is it important? • Inherently parallel • Memory intensive • Greatly benefit from GPUs • Used for bench marking hardware

  8. Early Pre-floating point • Larsen and McAllister described an early pre-floating point implementation of matrix multiplies. • 2D textures and blending operations to perform matrix product • nVidia GeForce3 GPU : 8-bit fixed point output • 4.4 GB ops

  9. L-M GPU-based Nested Loop Matrix multiplication for(i = 0; i < N; i++) { for(j = 0; j < N; j++) { Z[i,j] = 0; // Each iteration in the following loop is quadrilateral // rasterization of size N x N for(k = 0; k < N; k++) Z[i,j] = Z[I, j] + X[i,k] * Y[k,j]; } }

  10. Continued .. The ability to modulate with two textures provides us with the ability to multiply values from two separate textures using the graphics hardware

  11. Continued… Load texture texA with matrix A. Load texture texB with matrix B. Set the multitexturing mode to modulate. Set the framebuffer write mode to accumulate. For i = 0 .. n -1 draw an n x n pixel rectangle with texA coords (0,i), (0,i), (n-1, i), (n-1, i) and texB coords (i, 0), (i, n-1), (i, n-1), (i,0). End Screen contains result of A x B

  12. Floating point textures • Texture targets • GL_TEXTURE_2D • Coordinates have to be normalized [0, 1] • GL_TEXTURE_RECTANGLE_ARB • Coordinates are not normalized

  13. Vector 1 Vector 2 Pass through TexUnit 0 TexUnit 1 Vector 3 return tex0+tex1 Courtesy Jens Krüger Operations • Vector-Vector Operations • Reduced to 2D texture operations • Coded in pixel shaders • Example: Vector1 + Vector2  Vector3 Render Texture Static quad Vertex Shader Pixel Shader

  14. Vector-vector operation: saxpy • Scalar alpha * x plus y, i.e. (alpha * x + y) • Y_new = y_old + alpha * x • On CPU: for (int i = 0; i < N; i++) Y[i] = Y[i] + alpha * X[i] • Calculations are independent • Computational kernel Y_new[i] = Y_old[i] + alpha * X[i]

  15. Kernels = shaders float saxpy ( float2 coords : TEXCOORD0, uniform sampler2D textureY, uniform sampler2D textureX, uniform float alpha) : COLOR { float result; float y = tex2D(textureY, coords); float x = tex2D(textureX, coords); result = y + alpha * x; return result; }

  16. Four-channeled RGBA texture float4 saxpy ( float2 coords : TEXCOORD0, uniform sampler2D textureY, uniform sampler2D textureX, uniform float alpha) : COLOR { float4 result; float4 y = tex2D(textureY, coords); float4 x = tex2D(textureX, coords); result = y + alpha * x; return result; } 4 channeled textures to improve efficiency in fetching and computation

  17. Computing=drawing glPolygonMode(GL_FRONT,GL_FILL); glBegin(GL_QUADS) glTexCoord2f(0.0, 0.0); glVertex2f(0.0, 0.0); glTexCoord2f(texSize, 0.0); glVertex2f(texSize, 0.0); glTexCoord2f(texSize, texSize); glVertex2f(texSize, texSize); glTexCoord2f(0.0, texSize); glVertex2f(0.0, texSize); glEnd(); Texture rectangles

  18. Computing=drawing glPolygonMode(GL_FRONT,GL_FILL); glBegin(GL_QUADS) glTexCoord2f(0.0, 0.0); glVertex2f(0.0, 0.0); glTexCoord2f(1.0, 0.0); glVertex2f(texSize, 0.0); glTexCoord2f(1.0, 1.0); glVertex2f(texSize, texSize); glTexCoord2f(0.0, 1.0); glVertex2f(0.0, texSize); glEnd(); Texture2D

  19. Reduction-type operations • Maximum, vector norm, dot products etc. • One method: render a 1x1 output • only one parallel processing element (PE) • Might exceed the maximum allowed shader length • Local maximum of each 2x2 group of elements in one kernel • Input: M by M, output: M/2 by M/2 • Recursively repeated until the final 1 by 1 • Logarithmetic number of iterations

  20. Rendering a quad void drawQuad(int w, int h) { glBegin(GL_QUADS); glMultiTexCoord2f(GL_TEXTURE0, -0.5, -0.5); glMultiTexCoord2f(GL_TEXTURE1, 0.5, -0.5); glMultiTexCoord2f(GL_TEXTURE2, -0.5, 0.5); glMultiTexCoord2f(GL_TEXTURE3, 0.5, 0.5); glVertex2f(0.0, 0.0); glMultiTexCoord2f(GL_TEXTURE0, 2*w-0.5, -0.5); glMultiTexCoord2f(GL_TEXTURE1, 2*w+0.5, -0.5); glMultiTexCoord2f(GL_TEXTURE2, 2*w-0.5, 0.5); glMultiTexCoord2f(GL_TEXTURE3, 2*w+0.5, 0.5); glVertex2f(w, 0.0); glMultiTexCoord2f(GL_TEXTURE0, 2*w-0.5, 2*h-0.5); glMultiTexCoord2f(GL_TEXTURE1, 2*w+0.5, 2*h-0.5); glMultiTexCoord2f(GL_TEXTURE2, 2*w-0.5, 2*h+0.5); glMultiTexCoord2f(GL_TEXTURE3, 2*w+0.5, 2*h+0.5); glVertex2f(w, h); glMultiTexCoord2f(GL_TEXTURE0, -0.5, 2*h-0.5); glMultiTexCoord2f(GL_TEXTURE1, 0.5, 2*h-0.5); glMultiTexCoord2f(GL_TEXTURE2, -0.5, 2*h+0.5); glMultiTexCoord2f(GL_TEXTURE3, 0.5, 2*h+0.5); glVertex2f(0.0, h); glEnd(); }

  21. maximum float maximum (float2 left: TEXCOORD0, float2 right: TEXCOORD1, float2 top: TEXCOORD2, float2 bottom: TEXCOORD3, uniform samplerRECT texture) : COLOR { float val1 = texRECT(texture, left); float val2 = texRECT(texture, right); float val3 = texRECT(texture, top); float val4 = texRECT(texture, bottom); return max(val1,max(val2,max(val3,val4))); }

  22. Matrix-Matrix multiplication for (i = 0; i < N; i++) for( j = 0; j < N; j++) C[i, j] =0; for(k = 0; k < N; k++) C[i, j] += A[i,k] * B[k, j]; • On CPUs • Suffers from locality: Elements of B are accessed column wise. Not sequential order in memory. • Bandwidth limited • Size of its per-loop prevents bandwidth amplification from the closest and fastest levels of memory hierarchy. “blocks” inputs into small Sub matrices that fit in Processor cache

  23. Fatahalian et. al. Matrix-matrix multiplication on the GPU • 2x2 blocks of matrix elements in 4 component texels • Reads 2 rows from matrix A and 2 columns of B to compute 4 elements of the output matrix • Each shader program performs 4 inner loop iterations of the CPU program • GPU texture caches are organized to capture 2D locality, so sequential access along either access are cache coherent for (k = 0; k < N/2; i++) C[i, j].xyzw += A[i,k].xxzz * B[k, j].xyxy + A[i,k].yyww * B[k, j].zwzw;

  24. Multipass algorithm • Single pass: instruction count exceed the limits of GPU architectures for sufficiently large matrices • 4 consecutive elements from a matrix column into a texel • 4x4 matrix by 4x1 vector products • 4x reuse of data in texel B • 6 fetches for 4 mad operations: 1.5x more fetches C[i, j].xyzw = accum[i,j].xyzw + A[i,k].xyzw * B[k/4, j].xxxx + A[i,k+1].xyzw * B[k/4, j].yyyy + A[i,k+2].xyzw * B[k/4, j].zzzz + A[i,k+3].xyzw * B[k/4, j].wwww

  25. Measurements – multiplication of 1024 x 1024 matrices Available floating point bandwidth from the closest cache on these GPUs is up to several times lower than that of current CPUs to their L1 caches

  26. Naga et. al. Matrix-matrix multiplication on the GPU • A memory model for scientific algorithms on Graphics Processors • Explicit blocking to avoid hardware dependencies • Computation on the tiles of the size T x T is invoked by drawing quadrilaterals of size T x T on the screen. • A single fragment program evaluates the dot product from vectors of size D in X and Y (inputs).

  27. GPU-based nested loop matrix multiplication for(ib = 0; ib < N; ib = ib + T) for(jb = 0; jb < N; jb = jb + T) for(kb = 0; kb < N; kb = kb + D) // following two loops invoked using a quadrilateral of size TxT for(i = ib; i < ib + T; i = i +1) for(j = jb; j < jb + T; j = j + 1) // following loop is performed inside // a fragment program for(k = kb; k < kb + D; k = k + 1) Z[i,j] = Z[i,j] + X[i,k] * Y[k,j]

  28. Measurements • NVIDIA 7900 GTX GPU • 17.6 GFLOPS • 40 GB / S • NVIDIA 6800 Ultra • 8 GFLOPS

  29. Matrix-Matrix multiplication in CUDA/CTM • ~ 100 GFLOPS • How? • Use of extended memory formats (fetch4) and deeper loop unrolling (R580) • Shader memory (G80) • On chip shared memory allows you to keep blocks of data close to the ALUs rather than having to ping-pong off-chip to an FBO. • SM is only exposed through CUDA

  30. Introduction to CUDA • Thread Batching • Thread Block • A batch of threads that can cooperate • Sharing data through fast shared memory • Synchronize their execution • Grid of Thread Blocks • Blocks that execute the same kernel can be batched together • Reduced thread cooperation threadID blockID

  31. Memory model

  32. Matrix multiplication using CUDA • Each thread block is responsible for computing one square sub-matrix Csub of C. • Block size of Csub is equal to 16

  33. void Mul(const float * A, const float * B, int hA, int wA, int wB, float * C) { float * Ad; int size = hA * wA * sizeof(float); cudaMalloc((void **) &Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float * Bd; size = wA * wB * sizeof(float); cudaMalloc((void **) &Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice); float * Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size); dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y); Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd); cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); } Loading A into device global memory Loading B into device global memory Allocate C on device global memory Execution configuration Launch device computation Read C from the device

  34. __global__ void Muld(const float * A, const float * B, int wA, int wB, float * C) { int bx = blockIDx.x; int by = blockIDx.y; int tx = threadIDx.x; int ty = threadID.y; int aBegin = wA * BLOCK_SIZE * by; int aEnd = aBegin + wA; // (wA + 1) * BLOCK_SIZE * by int aStep = BLOCK_SIZE; int bBegin = BLOCK_SIZE * bx; int bStep = BLOCK_SIZE * wB; float Csub = 0; …… aBegin : Index of the first sub-matrix of A processed by the block

  35. ….. ….. for( int a = aBegin, b = bBegin; a < aEnd; a += aStep, b+= bStep) { __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; As[ty][tx] = A[a + wA * ty + tx]; Bs[ty][tx] = B[b + wB * ty + tx]; __syncthreads(); for(int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ty][k] * Bs[k][tx]; __syncthreads(); } int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = CSub; } Iterate through all sub-matrices of A and B Load one sub-matrix of A and one of B from global memory to shared memory Synchronize to make sure both matrices Are fully loaded by all threads Compute the product of the two sub-matrices Synchronize to make sure that the product Of the two sub-matrices is done Write the block sub-matrix to global memory

  36. Conclusions • Different ways of representing matrices and vectors on the GPU • Various algorithms for matrix-matrix multiplication and vector-vector operations on the GPU • Computation throughput of the GPU • Memory performance of the GPU

  37. Questions?

  38. Thank you

  39. References • E.S. Larsen and D. McAllister. Fast matrix multiplies using graphics hardware. Super Computing 2001. • Dominik Göddeke. GPGPU Tutorials. http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/index.html • Adam Moravanszky, Dense matrix algebra on the GPU. 2003 http://www.shaderx2.com/shaderx.PDF • Nico Galoppo, Naga K. Govindaraju, Michael Henson and Dinesh Manocha: LU-GPU : Efficient algorithms for solving Dense Linear systems on Graphics hardware. Super Computing 2005 • CUDA Programming Guide and CUDA BLAS. http://developer.nvidia.com/object/cuda.html

  40. References ctd.. • K. Fatahalian, J. Sugerman, and P. Hanran, Understanding the Efficiency of GPU Algorithms for Matrix-Matrix multiplication, Graphics Hardware 2004 • Jens Krüger,Linear Algebra on GPUs SIGGRAPH 2004 course notes • Naga K. Govindaraju, Scott Larsen, Jim Gray, Dinesh Manocha, A Memory Model for Scientific Algorithms on Graphics Procesors, Super Computing 2006.

More Related