1 / 31

Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors

Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors. Ngai -Man Cheung, Oscar C. Au, Senior Member, IEEE, Man-Cheung Kung, Peter H.W. Wong, Senior Member, IEEE, and Chun Hung Liu CSVT NOVEMBER 2009. Outline. Introduction Intra-Prediction

dwight
Download Presentation

Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors Ngai-Man Cheung, Oscar C. Au, Senior Member, IEEE, Man-Cheung Kung, Peter H.W. Wong, Senior Member, IEEE, and Chun Hung Liu CSVT NOVEMBER 2009

  2. Outline • Introduction • Intra-Prediction • Parallel RD Optimized Intra-Mode Decision • Experiments • Conclusion

  3. Introduction • Multicore Graphics Processors • Graphic Processing Unit (GPUs) • Coprocessing units for CPUs to accelerate numerical and signal processing applications , thanks to high-performance multicore and pipeline architectures • Investigate the use of GPUs to perform RD optimized intra-mode selection in AVS and H.264

  4. Difficulties • Intra-Mode Decision • Dependency between current block and adjacent block • Determine the encoding bit-rate for each of the candidate modes, some conditional branching may be needed

  5. Contributions • Analyze the dependency constraints in intra-mode decision • Propose a strategy to determine the mode decisions of video blocks in parallel • Encode the blocks in novel orders • Extend a bit-rate approximation method to estimate the rate in RD cost computation

  6. Intra-prediction in H.264 2: DC mode 4x4 8 1 6 3 4 7 5 0 (a) 4 × 4 current blocks and their neighboring reconstructed pixels. (b) Prediction directions and their corresponding modes.

  7. Intra-prediction in AVS 1.0 8x8 Vertical mode Horizontal mode Down-left mode: bidirectional prediction DC mode Down-right mode

  8. Dependency Analysis • Dependency constraints on block encoding order • Prediction Direction • Determine the RD costs of the current block is hard before all the candidate reference blocks have been encoded and reconstructed • Pixel Filtering(AVS) • Filtering may be applied to the reconstructed pixels of the adjacent blocks before they are used in prediction, and this filtering may involve pixels from several blocks, leading to additional block dependency

  9. Dependency Analysis • Dependency between the four 8 × 8 blocks (K1-K4) in the current macroblock and their spatially adjacent neighbor blocks (T 1-T4,L1,L2), in AVS intra-prediction

  10. Dependency Analysis • Dependency between the four 4 × 4 blocks (K1-K4)in the current 8 × 8 block and their spatially adjacent neighbor blocks(T 1-T4,L1,L2), in H.264 intra-prediction

  11. Dependency Analysis • The dependency relationships form directed acyclic graphs. • Parallelize the RD cost computation of the four constituent blocks of the same 16x16MB • Compute in parallel RD costs of the blocks from different 16x16MB

  12. Greedy-Based Block Encoding Order • Encode those blocks of which all the reference reconstructed pixels are available.

  13. Greedy-Based Block Encoding Order • AVS Example

  14. Greedy-Based Block Encoding Order • AVS Example • AVS Example modify version • Postpone the encoding of several blocks • along the left frame boundary, • All the four constituent blocks of any • MB could be encoded consecutively • Does not incur any execution time penalty

  15. Greedy-Based Block Encoding Order • AVS Example

  16. Optimality P* • Lemma 1: The proposed greedy-based encoding order can process all bottleneck path(s) P∗ with exactly n∗ iterations • Proof of Lemma 1: • Suppose the greedy-based order requires more that n* iterations to process • At least one processing gap of length w which P* is not being processed between Ki and Ki+1 • There would exist an immediate parent block Bm of Ki+1 • Continuing with backtracking eventually one would reach some block Kj in P* • P1 has no processing gap >P0 • P* replace Po to P1 would be longer than p* P1 P0

  17. Optimality • Theorem 1: The proposed greedy-based order can process all the video blocks in a frame in n∗ iterations • P∗ = {K1,K2, ...,Kn∗ }, Kn∗ , would be processed in the n∗th iteration by Lemma 1. Since all the paths would also end in Kn∗ , all the blocks could be processed with n∗ iterations

  18. Performance estimation • One of the longest paths in H.264 4 × 4 intra-prediction • The length can be found to be • n*=((V/4)/2)x2+H/4-2 =V/4+H/4-2 • n*=(V/4)x2+(H/4)/2-2 =(V/4)x2 +H/8-2

  19. Bit-Rate Estimation • Lagrangian cost function • Entropy coding may involve many branching instructions, hard to implement on pipeline architecture

  20. Fast Bit Rate Estimation for Mode Decision • Tc : number of nonzero coefficients • Tz : number of zeros before the last nonzero coefficients • |Lk| : the absolute value of kth nonzero coefficient • Fk : the frequency of kth nonzero coefficient [33] M. G. Sarwer and L.-M. Po, “Fast bit rate estimation for mode decision of H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 10, pp. 1402–1407, Oct. 2007.

  21. Experiments • PC equipped with one GeForce 8800 GTS PCIe graphics card with 96 stream processors • Intel Pentium 4 3.2 GHz processor with 1GB DDR2 memory • H.264 JM 14.0 • AVS RM 6.2 reference software

  22. Encoding Bit-Rate Estimation

  23. Parallel RD Optimized Intra-Mode Decision QP has no significant effect • H.264 More than 80 times reduction

  24. Parallel RD Optimized Intra-Mode Decision • H.264 Parallelism within a MB

  25. Parallel RD Optimized Intra-Mode Decision Similar speedups when RDO is disabled • H.264

  26. Parallel RD Optimized Intra-Mode Decision • H.264

  27. Parallel RD Optimized Intra-Mode Decision • H.264

  28. Parallel RD Optimized Intra-Mode Decision • AVS

  29. Parallel RD Optimized Intra-Mode Decision • AVS

  30. Parallel RD Optimized Intra-Mode Decision 96(processors)x2(threads)/ 5(modes) = 38.4 39

  31. Conclusion • Based on the dependency analysis of intra-mode decision , encode the video blocks following the greedy orders, leading to highly parallel RD cost computations. • More than 80 times speedup for GPU based intra-prediction, GPU can be utilized to offload intra-prediction from CPU. • To facilitate implementation on GPU, use a bitrate approximation method to estimate the rate in RD cost computation. • The approximation errors only a small impact to the coding performance: no more than 0.12 dB loss in PSNR and 0.98% bit-rate increase.

More Related