340 likes | 543 Views
EE 5359 PROJECT PRESENTATION FAST INTER AND INTRA MODE DECISION IN H.264 VIDEO CODEC BASED ON THREAD-LEVEL PARALLELISM Project Guide – Dr. K. R. Rao Tejas Sathe (1000731145) Email ID: tejas.sathe@mavs.uta.edu. Introduction to H.264 [1] codec.
E N D
EE 5359 PROJECT PRESENTATION FAST INTER AND INTRA MODE DECISION IN H.264 VIDEO CODEC BASED ON THREAD-LEVEL PARALLELISM Project Guide – Dr. K. R. Rao TejasSathe (1000731145) Email ID: tejas.sathe@mavs.uta.edu
Introduction to H.264 [1] codec • H.264/MPEG-4 Part 10 or AVC (Advanced Video Coding): Standard by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) • Widely used for video compression, and is currently one of the most commonly used formats for the recording, compression, and distribution of high definition video. • A new video compression scheme that has become the worldwide digital video standard for consumer electronics and personal computers. • Significant improvement in the rate-distortion efficiency providing, typically, a factor of two in bit-rate savings when compared with existing standards.
H.264 profiles [1] • Baseline Profile: Real-time conversational services e.g. video conferencing and videophone • Main Profile: Digital storage media and television broadcasting • Extended Profile: Multimedia services over Internet • Four High Profiles: Content-contribution, content-distribution, and studio editing and post-processing Fig.1. H.264 Profiles [1]
How does the H.264 codec work? • A codec is a device or computer program which encodes and/or decodes a signal or digital data stream. • H.264 is a block-oriented motion compensation-based codec standard • An H.264 video encoder carries out prediction, transform and encoding processes to produce a compressed H.264 bit stream. The block diagram of the H.264 video encoder is shown in Fig 1. • A decoder carries out a complementary process by decoding, inverse transform and reconstruction to output a decoded video sequence. The block diagram of the H.264 video decoder is shown in Fig 2.
H.264 Encoder Block Diagram [1] • Fig.2. H.264 Encoder [1] • Works in two paths: • Forward path: Includes subtractor, transform, quantization and entropy coding. • Reconstruction path: Includes adder, inverse transform, inverse quantization, Intra prediction, Deblocking filter, Picture buffer, Motion estimation, Motion compensation and Intra/Inter mode decision.
H.264 Decoder Block Diagram [1] • Fig.3. H.264 Decoder [1] • Received bitstream is entropy decoded and rearranged to produce a set of quantized coefficients. These are rescaled and inverse transformed to give a difference macroblock. • Using the header information decoded from the bit stream, a prediction macroblock P is created and added to the difference macroblock. The result is filtered to create a decoded macroblock.
Some highlighted features [2] of H.264 video codec • Variable block-size motion compensation with small block sizes • Quarter-sample-accurate motion compensation • Multiple reference picture motion compensation • Weighted prediction • Improved “skipped” and “direct” motion inference • Directional spatial prediction for intra coding • In-the-loop deblocking filtering • Context-adaptive entropy coding • Flexible slice size • Flexible macroblock ordering (FMO)
Intra-prediction [1], [3] • A technique of extrapolating the edges of the previously decoded parts of the current picture and is applied in regions of pictures that are coded as intra. • H.264 uses the methods of predicting intra-coded macroblocks to reduce the high amount of bits coded by original input signal itself. • A prediction block is formed based on previously reconstructed (unfiltered for deblocking) blocks. • Residual signal between the current block and the prediction is finally encoded. • One mode is selected from a total of 9 for each 4x4 and 8x8 luma blocks; 4 modes for a 16x16 luma block; and 4 modes for each chroma blocks.
Intra-prediction Modes [4] Fig.4 4x4 intra prediction modes [4] Fig. 5 16x16 Intra prediction modes [4]
Inter-prediction [1],[14] • It includes motion estimation (ME) and motion compensation (MC). • ME/MC performs prediction. A predicted version of a rectangular array of pixels is generated by choosing another similarly sized rectangular array of pixels from previously decoded reference picture. • Reference array is translated to the position of current rectangular array to compensate for the motion in the video stream. • Different sizes of arrays for luma: 4x4, 4x8, 8x4, 8x8, 16x8, 8x16, 16x16 pixels. Fig. 6 Macro block partitions: 16x16, 8x16, 16x8, 8x8 [14] Fig.7 Sub-Macro block partitions: 8x8, 4x8, 8x4, 4x4 [14]
JM reference software [12] • The JM reference software is used for implementation of the H.264 codec. • The software package consists of configuration files, viz., encoder.cfg and decoder.cfg through which various input parameters like input sequence, frame rate, video resolution of the input sequence, bit rate, quantization parameter, profile to be used etc. can be set. • The command used under command prompt to execute the H.264 encoder is: lencod.exe -f encoder.cfg • encoder.cfg is parsed to get all the input parameters set by the user. • JM software version used for testing: JM 17.2 • Latest version available: JM 18.0
Test Sequences akiyo_cif.yuv akiyo_qcif.yuv carphone_cif.yuv carphone_qcif.yuv container_cif.yuv container_qcif.yuv
Results obtained using original JM 17.2 reference software akiyo_qcif, 30 FPS, 30 Frames encoded
Results obtained using original JM 17.2 reference software carphone_qcif, 30 FPS, 30 Frames encoded
Results obtained using original JM 17.2 reference software container_qcif, 30 FPS, 30 Frames encoded
Results obtained using original JM 17.2 reference software akiyo_cif, 30 FPS, 30 Frames encoded
Results obtained using original JM 17.2 reference software carphone_cif, 30 FPS, 30 Frames encoded
Results obtained using original JM 17.2 reference software container_cif, 30 FPS, 30 Frames encoded
Need of fast mode decision • Motion estimation in H.264 takes about 60 to 70 percent of the total encoding time. • Mode selection for intra and inter prediction results in considerable amount of computation and memory access. • In RD optimization, all the modes are checked and then the best one with the least rate distortion cost is selected. This increases coding efficiency; but, price to pay is increased computational complexity. • 592 RDO calculations by H.264/AVC encoder for intra prediction: To select the best mode for one macro block. • Increase in the computational complexity poses implementation limitations, especially on handheld devices, with limited battery-life.
How to make fast mode decision? • The complexity in mode selection for intra and inter mode selection can be reduced using thread level parallelism approach. • RDO mode decision algorithm can be implemented based on thread level parallelism for the H.264 encoder. • This approach can efficiently resolve the dependences and exploit thread-level parallelism for fast mode decision. • Challenge: Reduction in the total encoding time without PSNR loss and bit rate increment.
Multicore[6] : • An architecture design that places multiple processors on a single die (computer chip). Each processor is called a core. These designs, known as Chip Multiprocessors allow single chip multiprocessing. • Nowadays we have computers with multi processors or multi-core CPUs thus making parallel processing available to the masses. Fig. 8.
Parallel Processing [6], [15] • Having multiple processing units on the hardware does not make existing programs take advantage of it. In order to improve program performance, programmers must take an initiative to implement parallel processing capabilities to their programs to fully utilize the available hardware. • The focus of software design and development should be changed from sequential programming techniques to parallel and multithreaded programming techniques. • To take advantage of multicore processors : Understand the details of software threading model as well as capabilities of the platform hardware. • Types of parallel processing : 1.Task based 2. Data based
Thread level Parallelism [7] • Software thread : • A discrete sequence of related instructions that is executed • independently of other instruction sequences. • Hardware thread : • An execution path that remains independent of other hardware • execution paths. • The tasks of an application are coded in a parallel programming environment and are assigned to threads; which are then mapped to physical computation units for execution. • Use of thread-level Parallelism: • 1. To utilize available hardware resources efficiently • 2. To achieve significant speed up in the process
Methods to do parallel processing [8], [15] • MPI: Message Passing Interface - Most suited for a system with multiple processors and multiple memory. • OpenMP: Suited for shared memory systems like we have on our desktop computers. Shared memory systems are systems with multiple processors but each are sharing a single memory subsystem. Using OpenMP is just like writing your own smaller threads but letting the compiler do it. • SIMD intrinsics: Single Instruction Multiple Data instruction sets e.g. Intel's MMX, IBM's Altivec, AMD's 3DNow! etc. • SIMD intrinsics are primitive functions to parallelize data processing on the CPU register level.
OpenMP [15],[16]: • An application programming interface: Quite popular portable standard for shared memory parallel programming. It has been designed to introduce parallelism in existing sequential programs. • Provides a collection of compiler directives, library routines and environmental variables. • At compile time, multi-threaded program code is generated based on the compiler directives. • In particular, loop-based data parallelism can be exploited easily. • Use of shared and private data is supported. • Based on cooperating threads running on multiple processors or cores.
OpenMP • Threads creation and destruction: Fork-join pattern. • Parallel region: Program code inside the parallel construct; executed in parallel by all threads of a team. • At the end of parallel region: Implicit barrier synchronization; only master thread continues its execution after this region. • Methodology of parallelism using OpenMp [7]: • Study problem, sequential program, or code segment • Look for opportunities for parallelism. • Try to keep all processors busy doing useful work.
Relating fork/join to code [18] Fig. 9. Fork/join model [18] in OpenMP
Compiler Directives [15],[16]: • Parallelism is controlled by compiler directives in OpenMP. • In C and C++, directives are specified with #pragma mechanism. • General form of OpenMp directive: • #pragma omp directive [clauses [] ...] • The most important directive: parallel construct • #pragma omp parallel [clause [clause] ...] • {// structured block ...} • Parallel construct: for specifying a program part that should be executed in parallel • A team of threads is created to execute parallel region, in parallel.
Precautions [17] • Avoid data dependencies: Condition in which a loop iteration that is being executed on a different thread-reads or writes shared memory. • Avoid race conditions: Avoid loop iterations which is data-dependent upon a different iteration. The variables declared private in such cases. • Proper load balancing: Equal division of work among threads is necessary in order to ensure that processors are busy most of the time. • Manage shared and private data. • Examine all memory references, including references made by called functions. • Declare sections in the code critical wherever necessary. • e.g. #pragma omp critical
Future Work • Survey more literature on multicore processor architecture, hardware threads, Multicore Programming. • Identify more independent tasks and parallel regions in the reference software. • Implement various compiler directives, library routines and environmental variables provided by OpenMP. • Reduce time complexity using task level parallelism. • Compare results for at least 8 video sequences (CIF and QCIF) using original JM 17.2 reference software and modified one.
References [1] Soon-kak Kwon, A. Tamhankar and K.R. Rao, “Overview of H.264/MPEG-4 part 10”, Video/Image Processing and Multimedia Communications, 2003. [2] T. Wiegand, et al “Overview of the H.264/AVC video coding standard”, IEEE Trans. on circuits and systems for video technology, vol. 13, pp. 560-576, July 2003. [3] D. Marpe, T. Wiegand and G. J. Sullivan, “The H.264/MPEG-4 AVC standard and its applications”, IEEE Communications Magazine, vol. 44, pp. 134-143, Aug. 2006. [4] J. Kim, et al “Complexity reduction algorithm for intra mode selection in H.264/AVC video coding” J. Blanc-Talon et al. (Eds.): ACIVS 2006, LNCS 4179, pp. 454 – 465, 2006.Springer- Verlag Berlin Heidelberg, 2006. [5] Ju-Ho Hyun, “Fast mode decision algorithm based on thread-level parallelization and thread slipstreaming in H.264 video coding” Multimedia and Expo (ICME), 2010 IEEE International Conference
References [6] C. Hughes and T. Hughes, “Professional Multicore Programming Design and Implementation for C++ Developers”, Wiley 2010 [7] S. Akhter and J. Roberts, “Multi-Core Programming Increasing Performance through Software Multi-threading”, Intel Press 2006 [8] Eric Q. Li and Yen-Kuang Chen, “Implementation of H.264 Encoder on General-Purpose Processors with Hyper-Threading Technology”, Visual Communications and Image Processing 2004, edited by S. Panchanathan and B. Vasudev, Proc. of SPIE- IS&T Electronic Imaging, SPIE Vol. 5308 [9] B. Jung, et al “Adaptive Slice-Level Parallelism for Real-Time H.264/AVC Encoder with Fast Inter Mode Selection”, Multimedia Systems and Applications X, edited by S. Rahardja, J.W. Kim and J. Luo, Proc. of SPIE Vol. 6777, 67770J, (2007) [10] S. Ge, X. Tian and Yen-Kuang Chen, “Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper-Threading Architectures”, ICICS-PCM 2003.
References [11] I. Richardson, “The H.264 advance video compression standard”, 2nd Edition. Wiley 2010. [12] JM software – http://iphome.hhi.de/suehring/tml/ [13] J. Ren, et al, “Computationally efficient mode selection in H.264/AVC video coding”, IEEE Trans. Consumer Electronics, vol. 54, pp. 877 – 886, May 2008. [14] H.264/ MPEG-4 Part 10 White Paper : www.vcodex.com. [15] T. Rauber and G. Runger, “Parallel Programming for Multicore and Cluster Systems”, 2nd edition, pp. 339-341 [16] OpenMP - http://openmp.org/wp/ [17] Intel Software Network – http://software.intel.com/en-us/articles/getting-started-with-openmp/ [18] System Overview of Threading: http://ranger.uta.edu/~walker /CSE%205343_4342_SPR11/Web/Lectures/Lecture-4- Threading%20Overview-Ch2.pdf