1 / 41

Hierarchical Parallelization of an H.264/AVC Video Encoder

Hierarchical Parallelization of an H.264/AVC Video Encoder. A. Rodriguez, A. Gonzalez, and M.P. Malumbres. IEEE PARELEC 2006. Outline. Introduction Performance Analysis Hierarchical H.264 Parallel Encoder Experimental Results Conclusions. Introduction Background Knowledge (1/5).

sumi
Download Presentation

Hierarchical Parallelization of an H.264/AVC Video Encoder

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006

  2. Outline • Introduction • Performance Analysis • Hierarchical H.264 Parallel Encoder • Experimental Results • Conclusions

  3. IntroductionBackground Knowledge (1/5) • Video Communication

  4. IntroductionBackground Knowledge (2/5) • H.264/AVC • Remove sensitive redundant information • In order to reach the limits on compression efficiency  intensive computation • Video on demand, video conference, live broadcasting, etc.

  5. IntroductionBackground Knowledge (3/5) • H.264/AVC encoder • High CPU demand • Low latency • Real time response • Platforms with supercomputing capabilities • Clusters • Multiprocessors • Special purpose devices

  6. IntroductionBackground Knowledge (4/5) • Cluster • A group of linked computers • Improve performance and/or availability over that provided by a single computer • Categorizations • High-availability clusters • Load-balancing clusters • High-performance clusters

  7. IntroductionBackground Knowledge (5/5) • Message Passing Parallelism • Message passing runtimes and libraries  MPI • Multithread Parallelism • OpenMP • Optimized libraries • SIMD extension and global processing unit  Intel IPP, AMD ACML, etc.

  8. IntroductionMain Purpose (1/6) • Apply parallel processing to H.264 encoders in order to reduce computation intensity. • Given video quality and bit rate • Image resolution • Frame rate • Latency

  9. IntroductionMain Purpose (2/6) • Hierarchical parallelization of H.264 encoder • Two level MPI message passing parallelization • GOP level • Slice level

  10. IntroductionMain Purpose (3/6) • GOP level parallelism • Good speed-up • High latency GOP GOP GOP …….. …….. ……..

  11. IntroductionMain Purpose (4/6) Example of latency 1 GOP = 10 frames Frame rate = 30 frames/sec Time for encoding 1 GOP = 3 seconds We have to encode 9 GOP in parallel in order to achieve real time response Latency = 3 seconds 11

  12. IntroductionMain Purpose (5/6) • Slice level parallelism • Low latency • Less coding efficiency

  13. IntroductionMain Purpose (6/6) • Combination both approaches • Speed-up Efficiency

  14. Performance AnalysisOverview (1/2) • “Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations” • “A Parallel implementation of H.26L video encoder” • Combination • Scalability and low latency

  15. Performance AnalysisOverview (2/2) • Processing flow video sequence Increase throughput ……..…….. GOP GOP GOP GOP Reduce latency

  16. Performance AnalysisEquation definition • Little’s law • N = X * R • N : Number of GOPs processed in parallel. • X : Number of GOPs encoded per second. • R : Elapsed time between a GOP enters the • system and the same GOP is completely • encoded.

  17. Performance AnalysisAnalysis (1/2) • If we have np nodes in the cluster and every GOP decomposed in ns slices  N = np / ns • R = RSEQ / ( ns * Es) • RSEQ : Sequential encoding time of a GOP • Es : Parallel efficiency of slice level

  18. Performance AnalysisAnalysis (2/2) • GOP throughput of combined parallel encoder • If Es is significantly less than 1, throughput would be affected negatively

  19. Performance AnalysisExample (1/4) • Video sequence in HDTV format at 1280*720 • Frame rate = 60 frames / sec • We suppose that H.264 sequential encoder encodes one GOP(15 frames) in 5 seconds • Only one slice per frame is defined

  20. Performance AnalysisExample (2/4) • To get real time response, X has to be equal to 60 frames/sec or 4 GOPs/sec  np = 4 * 5 = 20 nodes

  21. Performance AnalysisExample (3/4) • Combined with slice level parallelism • Maximum of allowed latency = 1 sec • Slice parallelism efficiency = 0.8

  22. Performance AnalysisExample (4/4) • We set ns to 7 and N to 4, and number of required nodes is adjusted to 28 Throughput Latency

  23. Performance AnalysisEfficiency Estimation (1/5) • Why we have to estimate Es ? • Throughput • Latency • How to estimate Es? • PAMELA (PerformAnce ModEling LAnguage) model

  24. Performance AnalysisEfficiency Estimation (2/5) • Update DPB (Decoding Picture Buffer) in every node • Using MPI_Allgather • In this PAMELA model MPI_Allgather is implemented using binary tree

  25. Performance AnalysisEfficiency Estimation (3/5) • The PAMELA model to parallel encode one frame is : ns : The number of slices processed in parallel ts : The mean of slice encoding time tw : The mean wait time due to variations in ts and global synchronization tL : Start up time tc : Transmission time of one encoded slice L = par ( p = 1…ns ) delay (ts); delay (tw) seq ( I = 0…log2(ns)-1) par ( j = 1…ns) delay ( tL + tc * 2i)

  26. Performance AnalysisEfficiency Estimation (4/5) • The parallel time obtained solving this model is • Efficiency can be computed as T(L) = ts + tw + tAG tAG = log2 (ns) * tL + (ns - 1) * tc

  27. Performance AnalysisEfficiency Estimation (5/5) • The experimental estimations of parameter values • Estimated efficiency for a slice based parallel encoder

  28. Performance AnalysisSlice Parallelism Scalability (1/4) • The feasible number of slices will depend on the video resolution Bit rate increment (%) Number of MBs per slice

  29. Performance AnalysisSlice Parallelism Scalability (2/4) • Bit rate overhead vs. number of slices per frame

  30. Performance AnalysisSlice Parallelism Scalability (3/4) • PSNR loss vs. number of slices per frame

  31. Performance AnalysisSlice Parallelism Scalability (4/4) • Encoding time vs. number of slices per frame

  32. Hierarchical Parallel Encoder Overview • In order to achieve scalability and low latency • Combine GOP and slice level parallelism • In the first level • Divide sequence in GOPs(15 frames) • Every GOP is assigned to a processor group inside the cluster • Each group encodes independently

  33. Hierarchical Parallel Encoder GOP assignment method • Local manager • Communicate with global manager • Global manager • Inform the GOP assignment by sending a message with the GOP number to the requesting local manager • Simple and load balance

  34. Hierarchical Parallel Encoder Framework • Hierarchical H.264 parallel encoder Global Manager P0 P0 P0 P1 P2 P1 P2 P1 P2

  35. Experimental ResultsEnvironments (1/2) • Mozart • 4 biprocessor nodes with AMD Opteron 246 at 2 GHz interconnected by a switched Gigabit Ethernet • Aldebaran • SGI Altix 3700 with 44 nodes Itanium II interconnected by a high performance proprietary network

  36. Experimental ResultsEnvironments (2/2) • 720 * 480 standard sequence Ayersroc which composed by 16 GOPs

  37. Experimental ResultsSystem Speedup (1/2) • Speed up in Mozart

  38. Experimental ResultsSystem Speedup (2/2) • Speed up in Aldebaran

  39. Experimental ResultsEncoding Latency • Mean GOP encoding time

  40. Conclusions • A hierarchical parallel video encoder based on H.264/AVC was proposed. • Experimental results confirm the results from previous analysis, showing the ability of getting a scalable and low latency H.264 encoder. • Some issues remains open, as mentioned in previous section.

  41. Reference [1] J.C. Fernández and M. P. Malumbres, “A Parallel implementation of H.26L video encoder”, in proc. of EuroPar 2002 conf. (LNCS 2400), pp. 830, 833, Padderborn, 2002. [2] A. Rodriguez, A. González and M.P. Malumbres,“ Performance evaluation of parallel MPEG-4 video coding algorithms on clusters of workstations ”, IEEE Int. Conference on Parallel Computing in Electrical Engineering, pp. 354, 357, Dresden, 2004. [3] Arjan J.C. van Gemund, “Symbolic Performance Modeling of Parallel Systems”, IEEE Transactions on Parallel and Distributed Systems, vol 14, no 2, Feb. 2003. [4] Pacheco, P.S.: Parallel Programming with MPI, Morgan Kaufman Publishers, Inc.

More Related