Optimization of H.264 High Profile Decoder for Pentium 4 Processor

Optimization of H.264High Profile Decoder for Pentium 4 Processor Tarun Bhatia University of Texas at Arlington tarun@fastvdo.com

H.264Decoder Video Output Bitstream Input + Entropy Decoding Inverse Transform and Dequantization Deblocking + Intra/Inter Mode Selection Picture Buffering Intra Prediction Motion Compensation

Optimization:Need • H.264/AVC video coding introduces substantially more coding tools and coding options than earlier standards. Therefore, it takes much more computational complexity to achieve highest possible coding gain. • Aggressive optimization is typically required in order to get H.264 implementations to meet cost and power targets and provide real-time performance for applications.

Sequences Used Girl.264 Karate.264 Golf.264 Shore.264 Plane.264

H.264 Profiles High Profile Adaptive Block Size Transform Perceptual Quantization Matrices Extended Profile Main Profile B slices Weighted Prediction CABAC Data Partition I slice P slice CAVLC Arbitrary Slice Order (ASO) Frame Macroblock Ordering (FMO) Redundant Slices Baseline Profile SP Slice SI Slice

H.264 High Profiles - features • Main Profile + additional features • 8x8 Integer DCT • HVS matrices • 8x8 Intra Prediction modes

Optimization : Levels • Algorithm Level e.g. DCT implementation • Compiler Level (Microsoft Visual Studio .NET 2003 / Intel C++ compiler v 8.0) • Implementation Level e.g. Elimination of Loops, Conditions Using SIMD for implementation Multithreading

Target Platform : Pentium 4 ProcessorIntel SIMD Architecture 8 XMM Registers [128 bits] MXCSR [32 bit] 8 MMX Registers [64 bit] 8 GPRs [32bit] X87 FP Register File EFLAGS[32bit] FP MMX SSE/SSE2/ SSE3 FP MOVE L1 Data Cache (8KB 4-way)

Intel HT (Hyper Threading) Technology Purpose : Simultaneous Execution of Threads SYSTEM BUS

Optimization : Steps • Optimization during code development • Optimization after code development 1) Searching for “hotspots” in the code 2) Analysis of “hotspot” e.g. more number of calls, cache miss, slower implementation 3) Optimization of hotspots

Performance Profiling • Intel VTuneTM Performance Analyzer

Intel VTune Performance Analysis - Results (FastVDO H.264 HD High Profile Decoder)

Distribution of Decoder Time Consumption

SIMD • Single Instruction Multiple Data Instructions • Intel Pentium 4 MMX ( Multimedia Extension) from Pentium MMX onwards SSE ( Streaming SIMD Extension ) from Pentium III onwards SSE2 ( Streaming SIMD Extension 2) from Pentium IV onwards • AMD Athalon 64 3D Now

SIMD Data Types 128 Available in XMM registers in SSE Technology Available in MMX and XMM registers

SIMD Instructions : Types • Packed Arithmetic (e.g. padd, pmul) • Packed Logical (e.g. pand, por) • Data Movement and Memory Access (mov) • General Support (pack, unpack) • Packed Shift ( >> ,<< ) • Packed Comparison (<=, = =)

Case Study interpolation4x4 (pixel_data * forward_block, pixel_data* backward_block) { pixel_data* result; for (int i=0 ; i<=15 ; i++) { result [i] = (forward_block[i] + backward_block[i]+1)/2; } }

MMX Code interpolation (pixel_data* forward_block , pixel_data* backward_block) { ___asm { __asm { pxor mm7,mm7 // set mm7 to 0 mov EDX, 0x01010101 // EDX = 01 01 01 01 mov EAX, forward_block // Store forward block starting address movd mm3, EDX // mm3: 00 00 00 00 01 01 01 01 mov EBX, backward_block // Store backward block starting address punpcklbw mm3,mm7 // mm3: 00 01 00 01 00 01 00 01 mov ECX, result // Store the address of result movd mm0, [EAX] // mm0: fb[1:4] movd mm1, [EBX] // mm1: bb[1:4] movd mm4, [EAX+4] // mm4: fb[5:8] movd mm5, [EBX+4] // mm5: bb[5:8] punpcklbw mm0,mm7 // punpcklbw mm1,mm7 // punpcklbw mm4,mm7 // punpcklbw mm5,mm7 // paddw mm0, mm1 // mm0: fb[1:4]+bb[1:4] paddw mm4, mm5 // mm4: fb[5:8]+bb[5:8] paddw mm0, mm3 // mm0: fb[1:4]+bb[1:4]+1 paddw mm4, mm3 // mm4: fb[5:8]+bb[5:8]+1 psrl mm0, 1 // mm1: (fb[1:4]+bb[1:4]+1)>> 1 psrl mm4, 1 // mm5: (fb[5:8]+bb[5:8]+1)>> 1 packuswb mm0,mm0 // mm0: 00 00 00 00 r4 r3 r2 r1 packuswb mm4,mm4 // mm4: 00 00 00 00 r8 r7 r6 r5 movd [ECX],mm0 // result[1:4] = mm0 movd [ECX+4],mm4 // result[5:8] = mm4 //Repeat the same process for fb[9:16] and bb[9:16] emms // Empty MMX state } }

SIMD Application Results • Amdahl’s Law : The Overall Speedup (O.S.) obtained by optimizing a portion p of the program by a factor s is O.S. = 1 x 100 % ----------------- - 1 1 – p + (p/s) p  fraction of the code being optimized s  speedup factor for that fraction of code

Application to IDCT 4x4

IDCT 4x4 Comparison of % Time Consumed Of the Total Decoding Time

% Overall Speed up in Decoding Time with SIMD IDCT4x4

Application to Motion Compensation The implementation of Motion Compensation can be divided as :- • Data Manipulation (SIMD not used) • Interpolation (SIMD used) • Half Pel Interpolation • Quarter Pel Interpolation • Linear Interpolation for B frames

Motion Compensation-% Time consumption (without MMX)

SIMD Application to Motion Compensation - Results

Motion Compensation – ResultsComparison of % Time Consumed

% Overall Speed up in Decoding Time with SIMD MC

Multithreading • Definition : Multithreading is the ability of the program to multitask within itself. The program can split itself into separate “threads” of execution that seem to run concurrently. • Waitsare used to block the thread till a particular event hands over control • Releaseis use to unblock the thread • Semaphores : Locking mechanism / Counters to control access to shared resources being used by multiple processes

Producer-Consumer Problem (Diagram) Producer Thread Consumer Thread Semaphores Wait Serial Execution Of a Thread Release

Producer-Consumer Problem (Algorithm) • Producer thread starts and initialize data • Wait for the Consumer thread • If Consumer thread ready, release control to the consumer thread • Producer thread completes one execution cycle in the meantime and waits for Consumer thread • When the control is passed back to Producer thread, the process is repeated till the end condition is met.

Multithreading in Video Coding The Codec can be multithreaded in two ways:- • Block Level • Independent blocks can be executed as separate threads e.g. slices in H.264, motion estimation, deblocking of non-reference frames • GOP Level • Closed GOP : Group of frames which will not use any reference frames except from their GOP • Open GOP : Group of frames can use reference frames from outside their GOP

Proposed Multithreading Architecture -features • GOP Level (Closed GOP) • 30 frames per GOP • IPPPPPPP…P • Each GOP begins with an I frame and contains P frames only (i.e. 1 I frame and 29 P frames in each ) • B frames are not used in the design to maintain closed GOP structure

Proposed Multithreading Architecture Get IDR Position Main Thread Decoder 0 Decoder 1 Decoder N

Multithreaded Decoder - Threads • Main Thread • Creates all threads and semaphores • Get SPS and PPS NALUs from the • Initialize Multiple decoders with SPS and PPS NALUs • Get IDR Frame Position Thread • Search for IDR NALU Position in the bitstream • Manage Waits and Releases of Semaphores • Decoder Threads • Decode H.264 GOPs SPS  Sequence Parameter Set PPS Picture Parameter Set NALU  Network Abstraction Layer Unit

Multithreading - Results% Speed up in Decoding Time Number of Threads

Multithreading-ResultsThreading Overhead (Time in seconds) No. of Threads

Further Research • Optimization of High Profile HD (720p) Encoder for minimization of Hardware requirement • Testing of the H.264 encoder and decoder on multicore CPUs • Implementation of time consuming modules of H.264 encoder and decoder on GPU (Graphic Processing Unit)

References • H.264: International Telecommunication Union, “Recommendation ITU-T H.264: Advanced Video Coding for Generic Audiovisual Services,” ITU-T, 2005. • MPEG-2: ISO/IEC JTC1/SC29/WG11 and ITU-T, “ISO/IEC 13818-2: Information Technology-Generic Coding of Moving Pictures and Associated Audio Information: Video,” ISO/IEC and ITU-T, 1994. • Soon-kak Kwon, A.Tamhankar and K.R.Rao ,”Overview of MPEG-4 Part 10”. • G. Sullivan, P. Topiwala and A. Luthra, “The H.264/AVC Advanced Video Coding Standard: Overview and Introduction to the Fidelity Range Extensions,” SPIE Conference on Applications of Digital Image Processing XXVII, vol 5558 , page 53-74, Aug 2004. • The Software Optimization Cookbook, Intel Press, 2002. • IA-32 Intel Architecture Optimization, Reference Manual, www.intel.com • Optimization Applications with the Intel C++ and FORTRAN compilers, White paper, http://developer.intel.com/design/pentium4/manuals/ • J.Lee, S.Moon and W.Sun, “H.264 Decoder Optimization Exploiting SIMD Instructions”, Seoul National University. http://sips03.snu.ac.kr/pub/conf/c67.pdf Accepted at IEEE Asia-Pacific Conference on Circuits and Systems, (APCCAS), December 2004. • Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485. • Horowitz, A. Joch, F. Kossentini, and A. Hallapuro,“H.264/AVC Baseline Profile Decoder Complexity Analysis,” IEEE Transactions for Circuits and Systems for Video Technology, vol.13, no. 7, pp. 704-716, July 2003.

References:Continued • http://www.blu-ray.com/ • http://www.hddvd.org/hddvd/ • http://www.fastvdo.com • http://www.intel.com • http://www.intel.com/software/products/vtune/ • http://msdn.microsoft.com

Thanks!!

Optimization of H.264 High Profile Decoder for Pentium 4 Processor