1 / 57

COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING

COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING. By Sudeep Gangavati Department of Electrical Engineering University of Texas at Arlington Supervisor : Dr.K.R.Rao. Outline. Introduction to video compression Why H.264 Overview of H.264 Motivation

psyche
Download Presentation

COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMPLEXITY REDUCTION OF H.264 USING PARALLEL PROGRAMMING By Sudeep Gangavati Department of Electrical Engineering University of Texas at Arlington Supervisor : Dr.K.R.Rao

  2. Outline • Introduction to video compression • Why H.264 • Overview of H.264 • Motivation • Possible approaches • Related work • Theoretical estimation • Proposed approach • Parallel computing • NVIDIA GPUs and CUDA Programming Model • Complexity reduction using CUDA • Results • Conclusion and future work

  3. Introduction to video compression • Video codec: A software or a hardware device that can compress and decompress • Need for compression : Limited bandwidth and limited storage space. • Several codecs : H.264, VP8, AVS China, Dirac etc. Figure 1 Forecast of mobile data usage

  4. Why H.264 ? • H.264/MPEG-4 part 10 or AVC (Advanced Video Coding) standardized by ITU-T VCEG and MPEG in 2004. • Approximately 50% bit-rate reductions over MPEG-2. • Most widely used standard. • Built on the concepts of earlier standards like MPEG-2. • Substantial compression efficiency. • Network friendly data representation. • Improved error resiliency tools • Supports various applications

  5. Overview of H.264 • There are two parts: • Encoder : Carries out prediction, transform, quantization and encoding processes to produce a H.264 bit-stream. • Decoder: Carries out the decoding, inverse transform, inverse quantization to reconstruct the earlier encoded video.

  6. H.264 encoder

  7. H.264 decoder

  8. Intra prediction • Exploit spatial redundancies • 9 directional modes for prediction for 4 x 4 luma blocks • 4 modes for 16 x 16 luma blocks • 4 modes for 8 x 8 chroma blocks

  9. Intra prediction • 9 modes for 4 x 4 luma block • 4 modes for 16 x 16 luma blocks

  10. Inter prediction • Exploits temporal redundancy • Involves prediction from one or more previous frames called reference frames

  11. Motion estimation and compensation • Motion estimation and compensation is a process of finding matching block • Motion search is performed. • Motion vectors are obtained that provide the displacement in the block.

  12. Transform, Quantization and Encoding • Predicted values are then transformed. • H.264 employs integer transform, basically rough approximation of DCT • After transform, the values are quantized for compression • Entropy encoding : CAVLC / CABAC

  13. Motivation • Performed a time profiling on H.264 and obtained : • Motion estimation takes more time than any other module in H.264 • Need to reduce this time by efficient implementation without sacrificing video quality and bitrate. • With reduced motion estimation time, the total time for encoding is reduced.

  14. Possible approaches • Encoder optimization Levels : • Algorithmic Level : Develop new algorithms similar to Three step algorithm, fast mode decision algorithm etc. • Compiler Level : Efficient programming • Implementation Level: Using parallel programming using CUDA, OpenMP , utilize underlying hardware etc.

  15. Related work

  16. Issues with previous work • Focus only on achievable speed up. • Does not consider the methods to decrease the bitrate • Does not consider techniques to maintain video quality • Thread creation overhead and limitations in some approaches.

  17. Theoretical estimation by Amdahl`s Law [43] • We use this law to find out maximum achievable speed up • Widely used in parallel computing to predict theoretical maximum speed up using multiple processors. • Amdahl`s law states that if P is the proportion of a program that can be made parallel and (1-P)is the proportion that cannot be parallelized, then maximum speedup that can be achieved by using N processors is

  18. Using Amdahl`s Law • Approximation of speed up achieved upon parallelizing a portion of the code • P: parallelized portion • N: Number of processor cores • In the encoder code, motion estimation accounts to approximately 2/3rd of the code . • Applying the law the maximum speedup that can be achieved in our case is 2.2 times or 55% time reduction.

  19. Proposed work • We propose the following to address the problem : • Using CUDA for parallel implementation for faster calculation of SAD (sum of absolute differences) and use one thread per block instead of one thread per pixel to address the thread creation overhead and limitation. • Use a better search algorithm to maintain the video quality • Combine SAD cost values and save the bitrate • The above methods address all the issues mentioned earlier • Along with the above, we utilize shared and texture memory of the GPU that reduces the global memory references and provides faster memory access.

  20. Parallel Computing • Multi-core and many-core processors improve the efficiency by parallel processing • Parallel processing provides significant improvement • Techniques to program software on multiple core processors: • Data Parallelism • Task parallelism

  21. Parallel Computing • Data Parallelism • Split the large data set into smaller parts and execute them in parallel. After the execution, the data are grouped

  22. Parallel Computing • Task Parallelism • Distribute threads to different processors • Data could be common • May execute same or different code

  23. NVIDIA GPU And CUDA Programming Model • NVIDIA pioneered the Graphics Processing Units (GPU) Technology. First GPU: GeForce256 in 1999, had 128 MB of graphics memory. • GPUs, consisting of many core processors, are used in applications requiring high amounts of computation. • CPU-GPU Heterogeneous Model

  24. Host-Device Connection

  25. Compute Unified Device Architecture (CUDA) • NVIDIA introduced CUDA in 2006. • Programming model that make programs run on GPU. • The serial portions of our program written in C/C++ functions. • Parallel portions are written as GPU kernels. • C/C++ functions execute on CPU, kernels sent to GPU for processing.

  26. Problem decomposition • Serial C functions run on CPU • CUDA Kernels run on GPU

  27. Hardware Architecture • Main element : Stream multiprocessor (SM) • GT550M series has 2 SMs • Each SM has 48 cores • Each SM is capable of executing 1536 threads • Total of 3072 threads running in parallel

  28. Threading • Threads are grouped into blocks • Blocks are grouped into grids • All threads within a block execute on the same SM

  29. Complexity reduction using CUDA • Motion estimation: Process of finding the best matching block.

  30. Complexity reduction using CUDA • To find best matching block, search is done in the search window (or region). • Search provides the best matching block by computing the difference i.e. it obtains sum of absolute difference (SAD).

  31. Complexity reduction using CUDA • SAD (dx, dy)= • Search through search range of 8,16 or maximum 32 • Select the block with least SAD. • Larger the block size, more the computations A 352 x 288 frame

  32. Standard algorithm • Divide the block into 16 x 16 • Again, further divide it into Subblock of 8 x 8 . • Search through the search area • Compute SAD • obtain MVs

  33. Our approach • Main idea is to: • Minimize memory references and Memory transfer • Make use of shared memory and texture memory • Use single thread to compute SAD for single block • Make thread block creation dependent on the frame size for scalability • large number of threads are invoked that run in parallel and each block of thread consists of 396 threads that compute SADs of 396 - 8 x 8 blocks

  34. SAD mapping to threads Blocks 352 x 288 : (352/8) * (288 /8) = 1584 blocks that are to be computed for SAD. Total thread blocks = 4. Each block with 396 threads. This makes the approach scalable. For a video with higher resolution, like 704 x 480 ( 4SIF) or 704 x 576 (4CIF), we can create 16 blocks each with 396. So the number of threads created is dependent on video resolution.

  35. Performance enhancements • We consider Rate-distortion (RD) criteria and employ following techniques: • To minimize bitrate: • Calculate the cost for smaller sub blocks of 8 x 8 and combine 4 of these and form a single cost for 16 x 16 block. • To enhance video quality: • Incorporate exhaustive full search algorithm that goes on to calculate the matching block for the entire frame without skipping any blocks as opposed to other algorithms. Previous studies [30] show that, this algorithm provides the best performance. Though it is highly computational, this is used keeping video quality in mind.

  36. Memory access • Memory access from texture memory to shared memory • MemcpyAPI to move data into the Array we allocated: cudaMemcpyToArray( a_before_dilated, // array pointer 0, // array offset width 0, // array offset height h_before_dilated, // source width*height*sizeof(uchar1), // size in bytes cudaMemcpyHostToDevice); // type of memcpy Texture Memory Shared memory

  37. Performance Metrics

  38. Test Sequences

  39. Results The CPU-GPU implemented encoder performs better than the CPU-only encoder. But falls short when compared to NVIDIA Encoder. This is due to the fact that NVIDIA Encoder is heavily optimized at all levels of H.264 and not just motion estimation. NVIDIA has not released the type of searching algorithm it is using as well. Use of appropriate algorithm for motion search significantly changes the performance of quality, bitrate and speed. The theoretical speed up was about 2.2-2.5 times. From results, we achieve approx. 2 times speed up. This can be attributed to the other factors like the time it takes for load and store operations for functions , transfer of control to the GPU, memory transfer and references for operations that we have not considered and also other H.264 calculations etc.

  40. Results for QCIF video sequences PSNR vs. Bitrate for Akiyo sequence PSNR vs. Bitrate for Carphone sequence PSNR vs. Bitrate for Container sequence PSNR vs. Bitrate for Foreman sequence

  41. Results • SSIM provides the structural similarity between the input and output videos. Ranges from 0.0 to 1.0. 0 is the least quality video. 1 is the highest quality video

  42. Results • Similar behavior is observed in case of CIF video sequences.

  43. Results

  44. Results • SSIM values for our optimized software and NVIDIA encoder are very close.

  45. Conclusions • Nearly 50% reduction in encoding time on various sequences close to the theoretical estimation. • Less degradation in video quality is observed. • Less bitrate is obtained by uniquely combining the SAD costs of sub blocks into cost of larger macroblock • SSIM, Bitrate, PSNR are close to the values obtained without optimizations • Achieved data parallelism • With little modification in the code, the approach is actually scalable to better hardware and increased video resolution

  46. Limitations • As the threads work in parallel, in case when the sum of SADs till kthrow (k<8) exceeds the current SAD, then there is no need to compute further. But due to the concurrent processing, no best SAD is available until the thread is done calculating. • Search range cannot be modified while encoding is in progress. • Since this is a hardware implementation, the performance largely depends on the type of hardware used.

  47. Future work • Other operations in H.264 like filtering, entropy encoding can be parallelized. • Block dependencies are not considered in this approach. This could be challenging but results in higher compression efficiency • Different profiles like High and Main profiles can be used for implementation • Different motion estimation algorithm can be implemented in parallel and later on incorporated into H.264 • CUDA can be applied to HEVC, next generation video coding standard, successor to H.264. HEVC is known be more complex than H.264.

  48. Thank You

  49. References [1] I.E. Richardson, “The H.264 advanced video compression standard”, 2nd Edition, Wiley, 2010. [2] S. Kwon, A. Tamhankar, and K.R. Rao, “Overview of H.264/MPEG-4 part 10”, Journal of Visual Communication and Image Representation, vol. 17, no.2, pp. 186-216, April 2006. [3] Draft ITU-T Recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC), Mar. 2003. [4] G. Sullivan, “Overview of international video coding standards (preceding H.264/AVC)”, ITU-T VICA Workshop, July 2005. [5] T.Wiegand, et al “Overview of the H.264/AVC video coding standard”, IEEE Transactions on. Circuits and Sytems for Video Technology, vol.13, pp 560–576, July 2003. [6] M. Jafari and S. Kasaei, “Fast intra- and inter-prediction mode decision in H.264 advanced video coding”, International Journal of Computer Science and Network Security, vol.8, no.5, pp. 1-6, May 2008. [7] W. Chen and H. Hang, “H.264/AVC motion estimation implementation on Compute Unified Device Architecture (CUDA)”, 2008 IEEE International Conference on Multimedia and Expo, pp. 697-700, 26 April 2008.

  50. References [8] Y. He, I. Ahmad and M. Liou, “ A software-based MPEG-4 video encoder using parallel processing”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no.7, pp. 909-920, November 1998. [9] D. Marpe, T. Wiegand and G. J. Sullivan, “The H.264/MPEG-4 AVC standard and its applications”, IEEE Communications Magazine, vol. 44, pp. 134-143, Aug. 2006. [10] Z.Wang, et al, “ Image quality assessment : From error visibility to structural similarity”, IEEE Transactions on Image Processing, vol 13. Pp. 600-612, April 2004. [11] G. Sullivan, P. Topiwala, and A. Luthra, “The H.264/AVC advanced video coding standard: overview and introduction to the fidelity range extensions” SPIE Conference on Applications of Digital Image Processing XXVII, vol. 5558, pp. 53-74, 2004. [12] A. Puri, X. Chen and A. Luthra, “Video coding using the H.264/MPEG-4 AVC compression standard”, Signal Processing:Image Communication , vol.19 793–849, 2004. [13] K.R. Rao and P. Yip, Discrete cosine transform, Academic Press, 1990. [14] H. Yadav, “Optimization of the deblocking filter in H.264 codec for real time implementation” M.S. Thesis, E.E. Dept, UT Arlington, 2006. [15] https://computing.llnl.gov/tutorials/parallel_comp/, Introduction to parallel computing.

More Related