CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee

Parallelizing MPEG-2 with CUDA CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee Pradeep Joginipally

Outline • Motivation • Why use CUDA? • Why use MPEG-2? • MPEG-2 overview • CUDA overview • CUDA + video encoder implementation • Preliminary results, looks promising

Motivation • Video encoding is becoming a very resource intensive application for today’s desktop user • To encode a video is still a time consuming prospect • Advent of GPUs and CUDA technology • Video encoder is a highly parallel application • Make use of highly parallel GPU resources • Found very little work done to port video encoders onto CUDA platform • Check out how CUDA platform performs for encoders

MPEG-2? • We chose MPEG-2 over MPEG-4/H.264 • Initially going to use ffmpeg for H.264 encoding • Difficulties in incorporating CUDA library with ffmpeg • None of us have experience working with CUDA or encoders • Just wanted to get started, not bogged down with setting up a build environment • MPEG-2 still relevant (DVD encoding) • Chose mpeg2encoder reference library

MPEG-2 Overview: Frame Types • GOP (Group Of Picture) • I frame : Intra frame • Independently decodable. • P frame : Predictive frame • Refer to previous I and P frame • B frame : Bi-predictive Inter frame • Refer to both previous and later frames

Sample Input • Elephant’s Dream, open source animated short • Used by PARSEC h264 benchmark • 15,691 .png images, converted to .yuv format

How to compress and encode • Human eye has limited ability to fully perceive • Reduces or completely discards information in certain frequencies and areas • Color Space Y’CbCr • Y’ : Luma • Cb : Chroma Blue • Cr : Chroma Red • Human eye is much less sensitive to small changes in color than in brightness • 4 Y’ 1Cb 1Cr

Detail Encoding Structure Input Video read parm_file read quant_matrix init rate control init _seq put seqhdr Loop through all frames in encoding order frame reordering rate control Init_GOP put GOP header read frame frame/ field motion estimation (1) predict (6) dct_type estimation (7) transform (2) sub_pred fdct put picture (3)

Detail Encoding Structure (‘Cont) rate control int_pict put pict_hdr prev_ mquant Loop macroblock put slice hdr determine mquant MB quantiz macro_block mode put motion vector block decode rate control update pic inverse quantization (5) itransform (4) idct add pred calc_snr store frame putseqend Result

Main Functions Motion Estimation (75.3 %) Calculate Prediction (0.9 %) DCT Type Estimation (0.006 %) Subtract Prediction from Picture and Perform DCT (14.7 %) Quantize DCT coefficients and Generate VLC data (5.4 %) Inverse Quantize DCT coefficients (0.9 %) Perform IDCT and Add Prediction (1.5 %) DCT : Discrete Cosine TransformVLC : Video Lan Client

Unit Hierarchy Hierarchy • Sequence layer • GOP layer • Picture layer (40x24) • Slices (Frame) • Macroblock (16x16) • Block (8x8 pixels)

Motion Estimation & Predict

CUDA • Nvidia Architecture • GPU massively parallel • Flexible

CUDA cont… • Exposes API to GPU • Many Stream processors • More compute, less cache • SIMT • Conditions expensive!

CUDA cont… • Data Decomposition • Threads • Blocks • Grids

CUDA cont… • Memory complex • Device Memory • Shared Memory • Local Memory • Host  Device expensive

Implementation • Encoders can be parallelized at several concurrency granularities • Per GOP data • Per frame data • Per logical computational task • CUDA is vastly task concurrency focused • SIMT nature leads to implementation based on individual kernels of computation tasks

Example: Inverse Quantization dst[0] = src[0] << (3-dc_prec); for (i=1; i<64; i++) { val = src[i] * quant_mat[i] * mquant / 16; /* mismatch control */ if ((val&1)==0 && val!=0) val += (val>0) ? -1 : 1; /* saturation */ dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val); } • P frames need to be un-quantized to serve as residual references for future frames in motion estimation. • Done for 960 macroblocks in a frame, for 6 blocks per macroblock, for 64 pels (pixels) per block. • 368,640 iterations per frame • Simplified inverse quantization kernel code:

Step 1: CUDA Data Distribution • 960 Macroblocks • 940 Blocks in a 1-D grid • Macroblock ref is z coord • 6 Blocks x 64 Pels (pixels) • 384 2-D threads • Block ref is y coord • Pel ref is x coord

Calling Into the GPU Device dim3 dimGrid(mb_per_frame); dim3 dimBlock(block_per_mb, pel_per_block); iquant1_device<<<dimGrid,dimBlock>>>(...);

Step 2: CUDA-ize the Kernel • The Good • val = src[i] * quant_mat[i] * mquant / 16 • The Bad • if ((val&1)==0 && val!=0) • dst[0] = src[0] << (3-dc_prec) • The Ugly • val += (val>0) ? -1 : 1 • dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val);

CUDA-ize: The Good • Copy all blocks, the quantization matrices, and the metadata for all macroblocks to GPU device memory • i = threadIdx.x (pel) • j = threadIdx.y (block) • k = blockIdx.x (macroblock) Sequential: val = block[i] * quant_mat[i] * mquant / 16 CUDA: val = blocks_d[k*block_count+j][i] * quant_mat_d[i] * mbinfo_d[k].mquant) >> 4

CUDA-ize: The Bad • Branches waste time by causing redundant computation on the same data: • if ((val&1)==0 && val!=0) • Special handling of certain elements leads to conditionals: dst[0] = src[0] << (3-dc_prec); if (i == 0){ blocks_d[k*block_count+j][0] = ...

CUDA-ize: The Ugly • Some conditional branches can be eliminated by use of max and min, which CUDA handles built-in in 4 cycles: val += (val>0) ? -1 : 1 dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val); incr = max(val,-1); incr = min(incr,1); val -= incr; val = min(val,2047); blocks_d[k*block_count+j][i] = max(val,-2048);

Inverse Quantization Kernel Results • Input: • First 1000 frames of Elephant’s Dream • 640x360, 30 frames/sec, 144 kbps, NTSB • 4:2:0 chroma format, 4 frames per GOP, IBPB • 57x – 61x improvement!!! • Over a sequential version on a 2.33GHz Xeon • For the inverse quantization kernel

But That’s a Lie! • This didn’t take the time to copy data in and out of GPU device memory into account. • 2-3x slower in the quantization kernel • 00.5% slower total encoder program • Luckily, terrible CUDA memory access time can be ameliorated over all computational kernels. cudaMemcpy(mbinfo_d, mbinfo, ..., cudaMemcpyHostToDevice); cudaMemcpy(blocks_d, blocks, ..., cudaMemcpyHostToDevice); gettimeofday(&start_time,NULL); iquant1_device<<<dimGrid,dimBlock>>>(...); gettimeofday(&end_time, NULL); quant_time += ... cudaMemcpy(blocks, blocks_d, ..., cudaMemcpyDeviceToHost);

Final Thoughts • CUDA: The not so general GPGPU • No function pointers, no recursion • Very limited registers and cache space • The mystery of OpenMP, the attention to detail of Pthreads • Optimization requires intimate understanding of grid and block structures • We haven’t even gotten into memory usage tweaking • Still, remarkable results attainable with correctly structured program and enough effort

CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee

CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee

Presentation Transcript

March 2009

March, 2009

Jong Hwan Lee

March 9, 2009 DRAW IT!!!

Michael Daller and Nivja De Jong

March 2009

Michael O’Connor DISA, GE33 9 March 2005

Presented by Dr. Michael B. Brinkos Michael Garofalo Jan Klein

March 2009

March 2, 2009 – March 6, 2009

March 2009

March 2009

March, 2009

March 2009

March 2009

MARCH 2009

March, 2009

Heinz K. Klein Michael D. Myers

Annual report 2009 9 March 2010

Yeong-Jong Moon 1) , Jong-Heon Lee 2) and In-Won Lee 3)

Yoon-Jong Lee Sr. Manager Device Engineering Team