280 likes | 397 Views
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding. Florian H. Seitner , Michael Bleyer , Ralf M. Schreier , Margrit Gelautz. International Conference on Advances in Mobile & Multimedia ( MoMM 2008). Outline. Introduction Parallel H.264 Decoding Evaluated Methods
E N D
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding Florian H. Seitner, Michael Bleyer, Ralf M. Schreier, MargritGelautz International Conference on Advances in Mobile & Multimedia (MoMM 2008)
Outline • Introduction • Parallel H.264 Decoding • Evaluated Methods • Experimental Results • Conclusions
Introduction • H.264 video standard is currently used in a wide range of video-related areas • Video content distribution • Television broadcasting • High coding efficiency • Qpel motion estimation • Variable block size • Multiple reference frames Significantly increased CPU and memory loads
Introduction • Using multi-core systems to increase system performance • How to distribute H.264 decoding algorithm among multiple processing units ? • The decoding load should be distributed equally • Data dependency issues • Inter-communication • Synchronization
Introduction • The aim of this work is to evaluate the behavior of different decoding approaches • Run-time complexity • Efficient core usage • Data transfers
Parallel H.264 DecodingFunctional and Data-parallel splitting • Functional partitioned decoding system • Decoding tasks are assigned to individual processing cores • Each processing unit can be optimized for a certain task • Unequal workload distribution • High transfer rate for inter-communication
Parallel H.264 DecodingFunctional and Data-parallel splitting • Data-parallel decoding system • Distributing MBs among multiple processing unit • Data dependencies between different cores must be minimized • MB distribution onto the processing cores must achieve an equal workload balancing
Parallel H.264 DecodingThe H.264 Decoder • The H.264 decoding process Encoded Bitstream Inverse Quantization Inverse DCT Stream Parsing Entropy Decoder Deblocking + Spatial Prediction Motion Compensation Reference Frames Reconstructor Data-Parallel Processing Parser
Parallel H.264 DecodingMacroblock Dependencies • Data-parallel splitting of the decoder’s reconstruction module is challenging due to spatial and temporal dependencies Intra prediction Deblocking Inter prediction
Evaluated MethodsOverview • Comparing the performance of five different approaches for accomplishing data-parallel splitting of the decoder’s reconstructor module • Single row approach • Multi-column approach • Blocking slice-parallel method • Nonblocking slice-parallel method • Diagonal approach
Evaluated MethodsSingle Row Approach • The assignment of MBs to processors 2 Cores 4 Cores 8 Cores N is the number of processors Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding the ythrow of MBs if ( y mod N ) = i
Evaluated MethodsSingle Row Approach • An example of SR approach ( 2 cores ) • It takes a constant value of 1 unit of time to process a macroblock T = 2 T = 10 T = 34 T = 3 T = 8
Evaluated MethodsSingle Row Approach • Advantage • Simplicity • Only a small start delay • Disadvantage • So many dependencies across processor assignment borders
Evaluated MethodsMulti-column Approach • The assignment of MBs to processors 2 Cores 4 Cores 8 Cores w is the width of a multi-column Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding a MB of the xthcolumn if iw< x < ( i + 1)w
Evaluated MethodsMulti-column Approach • An example of MC approach ( 2 cores ) • Advantage • Less dependencies across processors • One processor has to wait for the results only at the boundaries T = 4 T = 36 T = 5 T = 8
Evaluated MethodsSlice-parallel Approach • The assignment of MBs to processors 2 Cores 4 Cores 8 Cores h is the height of a slice Processor i ( i = 0, 1, …, N - 1 ) is responsible for decoding a MB of the ythrow if ih< x < (i + 1)h
Evaluated MethodsSlice-parallel Approach • An example of SP approach in the blocking version ( 2 cores) • Disadvantage • Long delay • CPU idle, less core usage T = 26 T = 32 T = 58
Evaluated MethodsSlice-parallel Approach • An example of SP approach in the non-blocking version ( 2 cores ) • No dependencies is considered across slice boundaries (completely independent) • NBSP requires having full control over the encoder T = 1 T = 32
Evaluated MethodsDiagonal Approach • The assignment of MBs to processors • Dividing the first line of MBs into equally-sized columns • The assignments for the subsequent lines are derived by left-shifting the MB of the line above 2 Cores 4 Cores 8 Cores
Evaluated MethodsDiagonal Approach • An example of DG approach T = 4 T = 10 T = 12 T = 16 T = 13 T = 18 T = 20 T = 23 T = 43 T = 24
Evaluated MethodsDiagonal Approach • Comparing the inter-processor dependencies introduced by DG and MC approach Diagonal approach Multi-column approach Dependencies for CPU 2 originate solely from MB assigned to CPU1 MBs assigned to CPU 2 are also dependent on CPU 3
Experimental ResultsOverview • Test sequences • Parameters • GOP size = 14 • Search range = +/- 16 pixels • 5 reference frames
Experimental ResultsRun-time Complexity • Two major indicators for the efficiency of multi-core decoding system • Decoder’s run-time • A low run-time indicates a high system decoding performance • Number of data-dependency stalls occurring during the decoding process • The number of stalls provides an estimate on how efficiently the system’s computational resources are used
Experimental ResultsRun-time Complexity • Speed-up in run-time • The speed increase for each parallelization approach in multiples of the single-core performance
Experimental ResultsRun-time Complexity • Stall cycles caused by data dependencies between the cores
Experimental ResultsInter-communication • Memory transfer to and from the external DRAM and between the cores’ local memories are expensive in terms of power consumption and transfer time • Core inter-communication • Loading reference data and deblocking pixels
Experimental ResultsInter-communication • Data transform volume for reference data and deblocking information
Conclusions • In this study, we have evaluated 5 data-parallel approaches for the H.264 decoder • The run-time of each parallelization approaches is influenced by the frame partitions’ sizes and shapes • Large and dependency-minimizing partitions cause less inter-communication between cores