1 / 28

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding. Florian H. Seitner , Michael Bleyer , Ralf M. Schreier , Margrit Gelautz. International Conference on Advances in Mobile & Multimedia ( MoMM 2008). Outline. Introduction Parallel H.264 Decoding Evaluated Methods

min
Download Presentation

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding Florian H. Seitner, Michael Bleyer, Ralf M. Schreier, MargritGelautz International Conference on Advances in Mobile & Multimedia (MoMM 2008)

  2. Outline • Introduction • Parallel H.264 Decoding • Evaluated Methods • Experimental Results • Conclusions

  3. Introduction • H.264 video standard is currently used in a wide range of video-related areas • Video content distribution • Television broadcasting • High coding efficiency • Qpel motion estimation • Variable block size • Multiple reference frames Significantly increased CPU and memory loads

  4. Introduction • Using multi-core systems to increase system performance • How to distribute H.264 decoding algorithm among multiple processing units ? • The decoding load should be distributed equally • Data dependency issues • Inter-communication • Synchronization

  5. Introduction • The aim of this work is to evaluate the behavior of different decoding approaches • Run-time complexity • Efficient core usage • Data transfers

  6. Parallel H.264 DecodingFunctional and Data-parallel splitting • Functional partitioned decoding system • Decoding tasks are assigned to individual processing cores • Each processing unit can be optimized for a certain task • Unequal workload distribution • High transfer rate for inter-communication

  7. Parallel H.264 DecodingFunctional and Data-parallel splitting • Data-parallel decoding system • Distributing MBs among multiple processing unit • Data dependencies between different cores must be minimized • MB distribution onto the processing cores must achieve an equal workload balancing

  8. Parallel H.264 DecodingThe H.264 Decoder • The H.264 decoding process Encoded Bitstream Inverse Quantization Inverse DCT Stream Parsing Entropy Decoder Deblocking + Spatial Prediction Motion Compensation Reference Frames Reconstructor Data-Parallel Processing Parser

  9. Parallel H.264 DecodingMacroblock Dependencies • Data-parallel splitting of the decoder’s reconstruction module is challenging due to spatial and temporal dependencies Intra prediction Deblocking Inter prediction

  10. Evaluated MethodsOverview • Comparing the performance of five different approaches for accomplishing data-parallel splitting of the decoder’s reconstructor module • Single row approach • Multi-column approach • Blocking slice-parallel method • Nonblocking slice-parallel method • Diagonal approach

  11. Evaluated MethodsSingle Row Approach • The assignment of MBs to processors 2 Cores 4 Cores 8 Cores N is the number of processors Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding the ythrow of MBs if ( y mod N ) = i

  12. Evaluated MethodsSingle Row Approach • An example of SR approach ( 2 cores ) • It takes a constant value of 1 unit of time to process a macroblock T = 2 T = 10 T = 34 T = 3 T = 8

  13. Evaluated MethodsSingle Row Approach • Advantage • Simplicity • Only a small start delay • Disadvantage • So many dependencies across processor assignment borders

  14. Evaluated MethodsMulti-column Approach • The assignment of MBs to processors 2 Cores 4 Cores 8 Cores w is the width of a multi-column Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding a MB of the xthcolumn if iw< x < ( i + 1)w

  15. Evaluated MethodsMulti-column Approach • An example of MC approach ( 2 cores ) • Advantage • Less dependencies across processors • One processor has to wait for the results only at the boundaries T = 4 T = 36 T = 5 T = 8

  16. Evaluated MethodsSlice-parallel Approach • The assignment of MBs to processors 2 Cores 4 Cores 8 Cores h is the height of a slice Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding a MB of the ythrow if ih< x < (i + 1)h

  17. Evaluated MethodsSlice-parallel Approach • An example of SP approach in the blocking version ( 2 cores) • Disadvantage • Long delay • CPU idle, less core usage T = 26 T = 32 T = 58

  18. Evaluated MethodsSlice-parallel Approach • An example of SP approach in the non-blocking version ( 2 cores ) • No dependencies is considered across slice boundaries (completely independent) • NBSP requires having full control over the encoder T = 1 T = 32

  19. Evaluated MethodsDiagonal Approach • The assignment of MBs to processors • Dividing the first line of MBs into equally-sized columns • The assignments for the subsequent lines are derived by left-shifting the MB of the line above 2 Cores 4 Cores 8 Cores

  20. Evaluated MethodsDiagonal Approach • An example of DG approach T = 4 T = 10 T = 12 T = 16 T = 13 T = 18 T = 20 T = 23 T = 43 T = 24

  21. Evaluated MethodsDiagonal Approach • Comparing the inter-processor dependencies introduced by DG and MC approach Diagonal approach Multi-column approach Dependencies for CPU 2 originate solely from MB assigned to CPU1 MBs assigned to CPU 2 are also dependent on CPU 3

  22. Experimental ResultsOverview • Test sequences • Parameters • GOP size = 14 • Search range = +/- 16 pixels • 5 reference frames

  23. Experimental ResultsRun-time Complexity • Two major indicators for the efficiency of multi-core decoding system • Decoder’s run-time • A low run-time indicates a high system decoding performance • Number of data-dependency stalls occurring during the decoding process • The number of stalls provides an estimate on how efficiently the system’s computational resources are used

  24. Experimental ResultsRun-time Complexity • Speed-up in run-time • The speed increase for each parallelization approach in multiples of the single-core performance

  25. Experimental ResultsRun-time Complexity • Stall cycles caused by data dependencies between the cores

  26. Experimental ResultsInter-communication • Memory transfer to and from the external DRAM and between the cores’ local memories are expensive in terms of power consumption and transfer time • Core inter-communication • Loading reference data and deblocking pixels

  27. Experimental ResultsInter-communication • Data transform volume for reference data and deblocking information

  28. Conclusions • In this study, we have evaluated 5 data-parallel approaches for the H.264 decoder • The run-time of each parallelization approaches is influenced by the frame partitions’ sizes and shapes • Large and dependency-minimizing partitions cause less inter-communication between cores

More Related