Hierarchical Parallelization of an H.264/AVC Video Encoder

Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006

Outline • Introduction • Performance Analysis • Hierarchical H.264 Parallel Encoder • Experimental Results • Conclusions

IntroductionBackground Knowledge (1/5) • Video Communication

IntroductionBackground Knowledge (2/5) • H.264/AVC • Remove sensitive redundant information • In order to reach the limits on compression efficiency  intensive computation • Video on demand, video conference, live broadcasting, etc.

IntroductionBackground Knowledge (3/5) • H.264/AVC encoder • High CPU demand • Low latency • Real time response • Platforms with supercomputing capabilities • Clusters • Multiprocessors • Special purpose devices

IntroductionBackground Knowledge (4/5) • Cluster • A group of linked computers • Improve performance and/or availability over that provided by a single computer • Categorizations • High-availability clusters • Load-balancing clusters • High-performance clusters

IntroductionBackground Knowledge (5/5) • Message Passing Parallelism • Message passing runtimes and libraries  MPI • Multithread Parallelism • OpenMP • Optimized libraries • SIMD extension and global processing unit  Intel IPP, AMD ACML, etc.

IntroductionMain Purpose (1/6) • Apply parallel processing to H.264 encoders in order to reduce computation intensity. • Given video quality and bit rate • Image resolution • Frame rate • Latency

IntroductionMain Purpose (2/6) • Hierarchical parallelization of H.264 encoder • Two level MPI message passing parallelization • GOP level • Slice level

IntroductionMain Purpose (3/6) • GOP level parallelism • Good speed-up • High latency GOP GOP GOP …….. …….. ……..

IntroductionMain Purpose (4/6) Example of latency 1 GOP = 10 frames Frame rate = 30 frames/sec Time for encoding 1 GOP = 3 seconds We have to encode 9 GOP in parallel in order to achieve real time response Latency = 3 seconds 11

IntroductionMain Purpose (5/6) • Slice level parallelism • Low latency • Less coding efficiency

IntroductionMain Purpose (6/6) • Combination both approaches • Speed-up Efficiency

Performance AnalysisOverview (1/2) • “Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations” • “A Parallel implementation of H.26L video encoder” • Combination • Scalability and low latency

Performance AnalysisOverview (2/2) • Processing flow video sequence Increase throughput ……..…….. GOP GOP GOP GOP Reduce latency

Performance AnalysisEquation definition • Little’s law • N = X * R • N : Number of GOPs processed in parallel. • X : Number of GOPs encoded per second. • R : Elapsed time between a GOP enters the • system and the same GOP is completely • encoded.

Performance AnalysisAnalysis (1/2) • If we have np nodes in the cluster and every GOP decomposed in ns slices  N = np / ns • R = RSEQ / ( ns * Es) • RSEQ : Sequential encoding time of a GOP • Es : Parallel efficiency of slice level

Performance AnalysisAnalysis (2/2) • GOP throughput of combined parallel encoder • If Es is significantly less than 1, throughput would be affected negatively

Performance AnalysisExample (1/4) • Video sequence in HDTV format at 1280*720 • Frame rate = 60 frames / sec • We suppose that H.264 sequential encoder encodes one GOP(15 frames) in 5 seconds • Only one slice per frame is defined

Performance AnalysisExample (2/4) • To get real time response, X has to be equal to 60 frames/sec or 4 GOPs/sec  np = 4 * 5 = 20 nodes

Performance AnalysisExample (3/4) • Combined with slice level parallelism • Maximum of allowed latency = 1 sec • Slice parallelism efficiency = 0.8

Performance AnalysisExample (4/4) • We set ns to 7 and N to 4, and number of required nodes is adjusted to 28 Throughput Latency

Performance AnalysisEfficiency Estimation (1/5) • Why we have to estimate Es ? • Throughput • Latency • How to estimate Es? • PAMELA (PerformAnce ModEling LAnguage) model

Performance AnalysisEfficiency Estimation (2/5) • Update DPB (Decoding Picture Buffer) in every node • Using MPI_Allgather • In this PAMELA model MPI_Allgather is implemented using binary tree

Performance AnalysisEfficiency Estimation (3/5) • The PAMELA model to parallel encode one frame is : ns : The number of slices processed in parallel ts : The mean of slice encoding time tw : The mean wait time due to variations in ts and global synchronization tL : Start up time tc : Transmission time of one encoded slice L = par ( p = 1…ns ) delay (ts); delay (tw) seq ( I = 0…log2(ns)-1) par ( j = 1…ns) delay ( tL + tc * 2i)

Performance AnalysisEfficiency Estimation (4/5) • The parallel time obtained solving this model is • Efficiency can be computed as T(L) = ts + tw + tAG tAG = log2 (ns) * tL + (ns - 1) * tc

Performance AnalysisEfficiency Estimation (5/5) • The experimental estimations of parameter values • Estimated efficiency for a slice based parallel encoder

Performance AnalysisSlice Parallelism Scalability (1/4) • The feasible number of slices will depend on the video resolution Bit rate increment (%) Number of MBs per slice

Performance AnalysisSlice Parallelism Scalability (2/4) • Bit rate overhead vs. number of slices per frame

Performance AnalysisSlice Parallelism Scalability (3/4) • PSNR loss vs. number of slices per frame

Performance AnalysisSlice Parallelism Scalability (4/4) • Encoding time vs. number of slices per frame

Hierarchical Parallel Encoder Overview • In order to achieve scalability and low latency • Combine GOP and slice level parallelism • In the first level • Divide sequence in GOPs(15 frames) • Every GOP is assigned to a processor group inside the cluster • Each group encodes independently

Hierarchical Parallel Encoder GOP assignment method • Local manager • Communicate with global manager • Global manager • Inform the GOP assignment by sending a message with the GOP number to the requesting local manager • Simple and load balance

Hierarchical Parallel Encoder Framework • Hierarchical H.264 parallel encoder Global Manager P0 P0 P0 P1 P2 P1 P2 P1 P2

Experimental ResultsEnvironments (1/2) • Mozart • 4 biprocessor nodes with AMD Opteron 246 at 2 GHz interconnected by a switched Gigabit Ethernet • Aldebaran • SGI Altix 3700 with 44 nodes Itanium II interconnected by a high performance proprietary network

Experimental ResultsEnvironments (2/2) • 720 * 480 standard sequence Ayersroc which composed by 16 GOPs

Experimental ResultsSystem Speedup (1/2) • Speed up in Mozart

Experimental ResultsSystem Speedup (2/2) • Speed up in Aldebaran

Experimental ResultsEncoding Latency • Mean GOP encoding time

Conclusions • A hierarchical parallel video encoder based on H.264/AVC was proposed. • Experimental results confirm the results from previous analysis, showing the ability of getting a scalable and low latency H.264 encoder. • Some issues remains open, as mentioned in previous section.

Reference [1] J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, 2002. [2] A. Rodriguez, A. González and M.P. Malumbres,“ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, pp. 354, 357, Dresden, 2004. [3] Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb. 2003. [4] Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc.

Hierarchical Parallelization of an H.264/AVC Video Encoder