1 / 19

General Purpose GPU (GPGPU)

General Purpose GPU (GPGPU). Aaron Smith University of Texas at Austin Spring 2003. Motivation. Graphics processors are becoming more programmable DirectX/OpenGL - Vertex and Pixel Shaders Explore the current state of the art How would a typical application run on a GPU?

denali
Download Presentation

General Purpose GPU (GPGPU)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. General Purpose GPU (GPGPU) Aaron Smith University of Texas at Austin Spring 2003

  2. Motivation • Graphics processors are becoming more programmable • DirectX/OpenGL - Vertex and Pixel Shaders • Explore the current state of the art • How would a typical application run on a GPU? • What are the difficulties? Requirements?

  3. MPEG Overview • Format for storing compressed audio and video • Uses prediction between frames to achieve compression (exploits spatial locality) • “I” or intra-frames • simply a frame encoded as a still image (no history) • “P” or predicted frames • predicted from most recently reconstructed I or P frame • can also be treated like I frames when no good match • “B” or bi-directional frames • predicted from closest two I or P frames, one in the past and one in the future • no good match then intra code like I frame • Typical sequence looks like: • IBBPBBPBBPBBIBBPBBPB... • Remember what a B frame is??? • decode the I frame, then the first P frame then the first and second B frame • 0xx312645

  4. GPU Programming Model • Streams Programming • Pixel Shaders • store data in texture memory • use multiple passes to render and re-render to texture memory • Vertex Shaders??? • more powerful than pixel shaders from an instruction standpoint • but...not very useful because of restriction on accessing texture memory • What are the limitations? • branching ?

  5. MPEG and the GPU • decoding is sequential • data structures are regular • typical video stream is 352x240 • basic result is pixel color data

  6. NVIDIA Cg • High Level Shading Language • Vertex and Pixel Shaders • OpenGL and DirectX Support • Can be compiled at runtime!

  7. Cg Profiles 1) Which profile do we choose? Will the model fit? 2) What about portability? Can we move between architectures?

  8. DirectX 8 – PS_2_0

  9. PS_2_0 Cont.

  10. MPEG -> Cg Challenges • Data Types • float/int basic types on GPU • unsigned char dominate type in MPEG • Loops • Most profiles do not support loops unless they can be completely unrolled • i.e. loop.cg(49) : warning C7012: not unrolling loop that executes 352 times since maximum loop unroll count is 256 • No recursion • Normally not a problem we can change to iterative • But on the GPU we have a problem with “Loops” • Arrays • Severe restrictions on index variables • Some profiles assign each array element to a register • Ie. float array[10] uses ten registers • Pointers • Not supported

  11. Implementation • Only support 352x240 resolution • Allocate fixed data structures to hold frame • 352x240 = 84880 x 21120 x 21120 (yuv) • Hold data in texture memory • Use Cg pixel shaders • vertex shaders cannot access texture memory • Work backwards

  12. An Example C -> CG • Convert MPEG decoder store() routine into CG shader • Simplify…simplify…simplify • Factor

  13. store_ppm_tga() - Original else { conv422to444(src[1],u444); conv422to444(src[2],v444); } } strcat(outname,tgaflag ? ".tga" : ".ppm"); if ((outfile = open(outname,O_CREAT|O_TRUNC|O_WRONLY|O_BINARY,0666))==-1) { sprintf(Error_Text,"Couldn't create %s\n",outname); Error(Error_Text); } optr = obfr; if (tgaflag) { /* TGA header */ for (i=0; i<12; i++) putbyte(tga24[i]); putword(horizontal_size); putword(height); putbyte(tga24[12]); putbyte(tga24[13]); } crv = Inverse_Table_6_9[matrix_coefficients][0]; cbu = Inverse_Table_6_9[matrix_coefficients][1]; cgu = Inverse_Table_6_9[matrix_coefficients][2]; cgv = Inverse_Table_6_9[matrix_coefficients][3]; for (i=0; i<height; i++) { py = src[0] + offset + incr*i; pu = u444 + offset + incr*i; pv = v444 + offset + incr*i; for (j=0; j<horizontal_size; j++) { u = *pu++ - 128; v = *pv++ - 128; y = 76309 * (*py++ - 16); /* (255/219)*65536 */ r = Clip[(y + crv*v + 32768)>>16]; g = Clip[(y - cgu*u - cgv*v + 32768)>>16]; b = Clip[(y + cbu*u + 32786)>>16]; if (tgaflag) putbyte(b); putbyte(g); putbyte(r); else putbyte(r); putbyte(g); putbyte(b); } } if (optr!=obfr) write(outfile,obfr,optr-obfr); close(outfile); } static void store_ppm_tga(outname,src,offset,incr,height,tgaflag) char *outname; unsigned char *src[]; int offset, incr, height; int tgaflag; { int i, j; int y, u, v, r, g, b; int crv, cbu, cgu, cgv; unsigned char *py, *pu, *pv; static unsigned char tga24[14] = {0,0,2,0,0,0,0, 0,0,0,0,0,24,32}; char header[FILENAME_LENGTH]; static unsigned char *u422, *v422, *u444, *v444; if (chroma_format==CHROMA444) { u444 = src[1]; v444 = src[2]; } else { if (!u444) { if (chroma_format==CHROMA420) { if (!(u422 = (unsigned char *)malloc((Coded_Picture_Width>>1) *Coded_Picture_Height))) Error("malloc failed"); if (!(v422 = (unsigned char *)malloc((Coded_Picture_Width>>1) *Coded_Picture_Height))) Error("malloc failed"); } if (!(u444 = (unsigned char *)malloc(Coded_Picture_Width *Coded_Picture_Height))) Error("malloc failed"); if (!(v444 = (unsigned char *)malloc(Coded_Picture_Width *Coded_Picture_Height))) Error("malloc failed"); } if (chroma_format==CHROMA420) { conv420to422(src[1],u422); conv420to422(src[2],v422); conv422to444(u422,u444); conv422to444(v422,v444); }

  14. Quick Analysis • Pointers • Remove • Conditionals (if/else) • Remove • Dynamic Memory • Remove • File I/O • Remove • Table lookups • Remove • Constant array indexes • OK! • Constant loop invariants • OK!

  15. store_tga() - Simplified static void store_tga(unsigned char *src[]) { int i, j; int y, u, v, r, g, b; int crv, cbu, cgu, cgv; int incr = 352; int height = 240; int data_idx = 0; /* index into BitMap.data[] */ static unsigned char u422[176*240]; static unsigned char v422[176*240]; static unsigned char u444[352*240]; static unsigned char v444[352*240]; /* 352 x 240 x 3 frame */ BitMap.channels = 3; BitMap.size_x = 352; BitMap.size_y = 240; conv420to422(src[1],u422); /* u422 = src[1] */ conv420to422(src[2],v422); /* v422 = src[2] */ conv422to444(u422,u444); /* u444 = u422 */ conv422to444(v422,v444); /* v422 = v444 */ /* matrix coefficients */ crv = 104597; cbu = 132201; cgu = 25675; cgv = 53279; /* convert YUV to RGB */ for (i=0; i<height; i++) { for (j=0; j<horizontal_size; j++) { u = u444[incr*i+j] - 128; v = v444[incr*i+j] - 128; y = 76309 * (src[0][incr*i+j] - 16); #define CLIP(x) ( (x<0) ? 0 : ((x>255) ? 255 : x) ) r = CLIP((y + crv*v + 32768)>>16); g = CLIP((y - cgu*u - cgv*v + 32768)>>16); b = CLIP((y + cbu*u + 32786)>>16); BitMap.data[data_idx++] = r; BitMap.data[data_idx++] = g; BitMap.data[data_idx++] = b; } } #ifdef _WIN32 // output the frame DrawGLScene((tImageTGA *)&BitMap); #endif }

  16. Quick Analysis • Removed • If/else • Pointers • File i/o • Table lookups • What’s Left? • Function calls (for chrominance conversion) • conv420to422() and conv422to444() • YUV to RGB loop

  17. YUV -> RGB (cg) float3 main( in float3 texcoords0 : TEXCOORD0, /* texture coord */ uniform sampler2D yImage : TEXUNIT0, /* handle to texture with Y data */ in float3 texcoords1 : TEXCOORD1, /* texture coord */ uniform sampler2D uImage : TEXUNIT1, /* handle to texture with U data */ in float3 texcoords2 : TEXCOORD2, /* texture coord */ uniform sampler2D vImage : TEXUNIT2 /* handle to texture with V data */ ) : COLOR { float3 yuvcolor;// f(xyz) -> yvu float3 rgbcolor; yuvcolor.x = tex2D(yImage, texcoords0).x; yuvcolor.z = tex2D(uImage, texcoords1).y-0.5; yuvcolor.y = tex2D(vImage, texcoords2).z-0.5; rgbcolor.r = 2*(yuvcolor.x/2 + 1.402/2 * yuvcolor.z); rgbcolor.g = 2*(yuvcolor.x/2 - 0.344136 * yuvcolor.y/2 - 0.714136 * yuvcolor.z/2); rgbcolor.b = 2*(yuvcolor.x/2 + 1.773/2 * yuvcolor.y); return rgbcolor; } dcl_2d s0 dcl_2d s1 dcl_2d s2 def c0, 0.000000, 0.000000, 0.000000, 1.000000 def c1, 2.000000, 0.500000, 0.886500, 0.000000 def c2, 1.000000, -0.344000, -0.714000, 0.000000 def c3, 0.500000, 0.000000, 0.701000, 0.000000 dcl t0.xyz dcl t1.xyz dcl t2.xyz texld r0, t1, s1 texld r1, t0, s0 add r0.x, r0.y, -c1.y mov r1.z, r0.x texld r0, t2, s2 add r0.x, r0.z, -c1.y mov r1.y, r0.x dp3 r0.x, r1, c3 mul r0.x, c1.x, r0.x dp3 r0.w, r1, c2 mov r0.y, r0.w dp3 r0.w, r1, c1.x mul r0.w, c1.x, r0.w mov r0.z, r0.w mov r1.w, c0.w mov r1.xyz, r0 mov oC0, r1 // 17 instructions, 2 R-regs.

  18. Quick Analysis • YUV -> RGB • 17 instructions and 2 registers • 352x240 = 84480 px * 17 = ~1.4M instr/frame

  19. Just for Fun • What if we needed 1024 instructions?? • 352x240 = 84480 px * 1024 = 86,507,520 instr/frame

More Related