270 likes | 468 Views
Parallelizing MPEG-2 with CUDA. CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee Pradeep Joginipally. Outline. Motivation Why use CUDA? Why use MPEG-2? MPEG-2 overview CUDA overview
E N D
Parallelizing MPEG-2 with CUDA CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee Pradeep Joginipally
Outline • Motivation • Why use CUDA? • Why use MPEG-2? • MPEG-2 overview • CUDA overview • CUDA + video encoder implementation • Preliminary results, looks promising
Motivation • Video encoding is becoming a very resource intensive application for today’s desktop user • To encode a video is still a time consuming prospect • Advent of GPUs and CUDA technology • Video encoder is a highly parallel application • Make use of highly parallel GPU resources • Found very little work done to port video encoders onto CUDA platform • Check out how CUDA platform performs for encoders
MPEG-2? • We chose MPEG-2 over MPEG-4/H.264 • Initially going to use ffmpeg for H.264 encoding • Difficulties in incorporating CUDA library with ffmpeg • None of us have experience working with CUDA or encoders • Just wanted to get started, not bogged down with setting up a build environment • MPEG-2 still relevant (DVD encoding) • Chose mpeg2encoder reference library
MPEG-2 Overview: Frame Types • GOP (Group Of Picture) • I frame : Intra frame • Independently decodable. • P frame : Predictive frame • Refer to previous I and P frame • B frame : Bi-predictive Inter frame • Refer to both previous and later frames
Sample Input • Elephant’s Dream, open source animated short • Used by PARSEC h264 benchmark • 15,691 .png images, converted to .yuv format
How to compress and encode • Human eye has limited ability to fully perceive • Reduces or completely discards information in certain frequencies and areas • Color Space Y’CbCr • Y’ : Luma • Cb : Chroma Blue • Cr : Chroma Red • Human eye is much less sensitive to small changes in color than in brightness • 4 Y’ 1Cb 1Cr
Detail Encoding Structure Input Video read parm_file read quant_matrix init rate control init _seq put seqhdr Loop through all frames in encoding order frame reordering rate control Init_GOP put GOP header read frame frame/ field motion estimation (1) predict (6) dct_type estimation (7) transform (2) sub_pred fdct put picture (3)
Detail Encoding Structure (‘Cont) rate control int_pict put pict_hdr prev_ mquant Loop macroblock put slice hdr determine mquant MB quantiz macro_block mode put motion vector block decode rate control update pic inverse quantization (5) itransform (4) idct add pred calc_snr store frame putseqend Result
Main Functions Motion Estimation (75.3 %) Calculate Prediction (0.9 %) DCT Type Estimation (0.006 %) Subtract Prediction from Picture and Perform DCT (14.7 %) Quantize DCT coefficients and Generate VLC data (5.4 %) Inverse Quantize DCT coefficients (0.9 %) Perform IDCT and Add Prediction (1.5 %) DCT : Discrete Cosine TransformVLC : Video Lan Client
Unit Hierarchy Hierarchy • Sequence layer • GOP layer • Picture layer (40x24) • Slices (Frame) • Macroblock (16x16) • Block (8x8 pixels)
CUDA • Nvidia Architecture • GPU massively parallel • Flexible
CUDA cont… • Exposes API to GPU • Many Stream processors • More compute, less cache • SIMT • Conditions expensive!
CUDA cont… • Data Decomposition • Threads • Blocks • Grids
CUDA cont… • Memory complex • Device Memory • Shared Memory • Local Memory • Host Device expensive
Implementation • Encoders can be parallelized at several concurrency granularities • Per GOP data • Per frame data • Per logical computational task • CUDA is vastly task concurrency focused • SIMT nature leads to implementation based on individual kernels of computation tasks
Example: Inverse Quantization dst[0] = src[0] << (3-dc_prec); for (i=1; i<64; i++) { val = src[i] * quant_mat[i] * mquant / 16; /* mismatch control */ if ((val&1)==0 && val!=0) val += (val>0) ? -1 : 1; /* saturation */ dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val); } • P frames need to be un-quantized to serve as residual references for future frames in motion estimation. • Done for 960 macroblocks in a frame, for 6 blocks per macroblock, for 64 pels (pixels) per block. • 368,640 iterations per frame • Simplified inverse quantization kernel code:
Step 1: CUDA Data Distribution • 960 Macroblocks • 940 Blocks in a 1-D grid • Macroblock ref is z coord • 6 Blocks x 64 Pels (pixels) • 384 2-D threads • Block ref is y coord • Pel ref is x coord
Calling Into the GPU Device dim3 dimGrid(mb_per_frame); dim3 dimBlock(block_per_mb, pel_per_block); iquant1_device<<<dimGrid,dimBlock>>>(...);
Step 2: CUDA-ize the Kernel • The Good • val = src[i] * quant_mat[i] * mquant / 16 • The Bad • if ((val&1)==0 && val!=0) • dst[0] = src[0] << (3-dc_prec) • The Ugly • val += (val>0) ? -1 : 1 • dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val);
CUDA-ize: The Good • Copy all blocks, the quantization matrices, and the metadata for all macroblocks to GPU device memory • i = threadIdx.x (pel) • j = threadIdx.y (block) • k = blockIdx.x (macroblock) Sequential: val = block[i] * quant_mat[i] * mquant / 16 CUDA: val = blocks_d[k*block_count+j][i] * quant_mat_d[i] * mbinfo_d[k].mquant) >> 4
CUDA-ize: The Bad • Branches waste time by causing redundant computation on the same data: • if ((val&1)==0 && val!=0) • Special handling of certain elements leads to conditionals: dst[0] = src[0] << (3-dc_prec); if (i == 0){ blocks_d[k*block_count+j][0] = ...
CUDA-ize: The Ugly • Some conditional branches can be eliminated by use of max and min, which CUDA handles built-in in 4 cycles: val += (val>0) ? -1 : 1 dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val); incr = max(val,-1); incr = min(incr,1); val -= incr; val = min(val,2047); blocks_d[k*block_count+j][i] = max(val,-2048);
Inverse Quantization Kernel Results • Input: • First 1000 frames of Elephant’s Dream • 640x360, 30 frames/sec, 144 kbps, NTSB • 4:2:0 chroma format, 4 frames per GOP, IBPB • 57x – 61x improvement!!! • Over a sequential version on a 2.33GHz Xeon • For the inverse quantization kernel
But That’s a Lie! • This didn’t take the time to copy data in and out of GPU device memory into account. • 2-3x slower in the quantization kernel • 00.5% slower total encoder program • Luckily, terrible CUDA memory access time can be ameliorated over all computational kernels. cudaMemcpy(mbinfo_d, mbinfo, ..., cudaMemcpyHostToDevice); cudaMemcpy(blocks_d, blocks, ..., cudaMemcpyHostToDevice); gettimeofday(&start_time,NULL); iquant1_device<<<dimGrid,dimBlock>>>(...); gettimeofday(&end_time, NULL); quant_time += ... cudaMemcpy(blocks, blocks_d, ..., cudaMemcpyDeviceToHost);
Final Thoughts • CUDA: The not so general GPGPU • No function pointers, no recursion • Very limited registers and cache space • The mystery of OpenMP, the attention to detail of Pthreads • Optimization requires intimate understanding of grid and block structures • We haven’t even gotten into memory usage tweaking • Still, remarkable results attainable with correctly structured program and enough effort