420 likes | 608 Views
Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection. Bongsoo Jung, Byeungwoo Jeon. Journal of Visual Communication and Image Representation 2008. Outline. Introduction Complexity Analysis Method Pre Macroblock Mode Selection
E N D
Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication and Image Representation 2008
Outline • Introduction • Complexity Analysis • Method • Pre Macroblock Mode Selection • Adaptive Slice-level Parallelism • Experimental Results • Conclusions
Introduction • H.264/AVC achieves high coding efficiency • Variable block size, multiple reference frame, quarter-pel motion vector accuracy,etc. • High computational complexity • Complexity reduction algorithm • Parallel processing
Introduction • GOP level • Simple but high latency • Frame level • Keep coding efficiency, but the dependence among frames limits the thread scalability • Slice level • Encode independently but less coding efficiency • Macroblock level • High dependency
Introduction • MBs in a slice may not have similar computational complexity. • Unnecessary extra waiting time in some threads. PU0 slice 0 PU1 slice 1 PU2 slice 2 PU3 slice 3 PU4 slice 4 PU5 slice 5 PU6 slice 6 PU7 slice 7 Encoding time
Main Purpose • Objective • Using parallel algorithm to speed up H.264/AVC encoder • Maximize the parallelism efficiency by distributing the workload equally. • Method • Pre processing: Fast MB mode selection • Adaptive slice-level parallelism
Complexity Analysis • Inter prediction mode of MBs in H.264 • Intra prediction mode: 4*4, 16*16
Complexity Analysis • The run-time complexity of the H.264/AVC encoder • Pentium IV 2.4GHz • Foreman_CIF with IPPP structure
Pre Macroblock Mode SelectionOverview • Why? • High computational complexity of ME in variable block size • Remove unnecessary ME block size and RD calculation of intra prediction mode • This removal leads to • Complexity reduction • Workload balancing among slices
Pre Macroblock Mode SelectionInter MB mode selection • MC block sizes in video sequence • Foreground region : 8*8 or smaller • Non-moving region : 16*16 • High temporal correlation • Check consistency history of block size 16*16 and zero MV • Two measurements • Zero motion consistency (ZMC) • Large block consistency (LBC)
Pre Macroblock Mode SelectionInter MB mode selection • Zero Motion Consistency (ZMC) • Indicates how long a specified block has had a zero MV consecutively • When a block is encoded in intra mode • ZMC is set to 0 t : frame index , ZMC0 = 0, (n,m;i,j) indicates a 4*4 block at (n,m) within a MB (i,j) high value of ZMC high prob. of belonging to background region
Pre Macroblock Mode SelectionInter MB mode selection • Zero Motion Consistency Score • Indicates how likely a MB being a stationary region TMOTION : A threshold value
Pre Macroblock Mode SelectionInter MB mode selection • Large Block Consistency (LBC) • Indicates the number of continuous frames having a 16*16 MC block size at (i,j)thMB • When a block is encoded in intra mode • LBC is set to 0 bestModet(i,j) : The best MB mode of the (i,j) MB in tth frame LBC0 = 0
Pre Macroblock Mode SelectionInter MB mode selection • Large Block Consistency Score • Indicates how likely a MB being partitioned in 16*16 TMODE1 ,TMODE2 : Threshold values used to make the assessment of the LBC
Pre Macroblock Mode SelectionInter MB mode selection • A illustration of LBCS
Pre Macroblock Mode SelectionInter MB mode selection • Conditional probability of MB modes given ZMCS = High • The other block sizes are very unlikely to appear (less than about 0.04) • Early detect SKIP and P16*16 mode TMotion = 4
Pre Macroblock Mode SelectionInter MB mode selection • Joint conditional probability of given LBCS with ZMCS = Low TMODE1 = 1, TMODE2= 4 A: LBCS = High, B: LBCS = Medium, C: LBCS = Low
Pre Macroblock Mode SelectionPre selective intra mode selection • High computational load of computing RD costs of intra mode • Comparing temporal correlation with spatial correlation of the current MB prior to frame coding
Pre Macroblock Mode SelectionSelective intra mode selection • Mean Absolute Temporal Difference • Mean Absolute Spatial Difference cx,y : Pixel values at location (x,y) of MB in current frame rx,y : Pixel values at location (x,y) of MB in previous frame X, Y : Horizontal and vertical dimensions of a MB MASDH : The MASD between horizontally neighboring pixels MASDV : The MASD between vertically neighboring pixels
Pre Macroblock Mode SelectionSelective intra mode selection • Comparing MATD and MASD to determine whether current MB should calculate RD costs of intra modes • A larger w makes skipping intra mode search easier • A smaller QP will incur more intra modes than a larger QP More temporally correlated than spatially correlated w: Weighting factor, currently is set to 0.6
Pre Macroblock Mode SelectionMB mode classfication • Decision table of candidate MB mode • A block diagram of MB selection
Adaptive Slice-level ParallelismOverview • Characteristic • Easy to implement • Lower overhead of inter communication among processor unit • Good scalability • Increase bitrate • Slice boundary is defined on the basis of a fixed number of MBs or fixed number of bits Hard to decide a slice boundary prior to encoding
Adaptive Slice-level ParallelismFixed MB assignment • The number of consecutive MBs in each slice L : The number of processor units on a multi-core system M : The total number of MBs in a frame i : Slice index Example : number of processing unit L = 8, sequence resolution is CIF (352*288), M = 22*18 = 396 We can assign about 49 MBs to each slice
Adaptive Slice-level ParallelismFixed MB assignment • The scheduling of slice-level parallelism in eight processor units Ideal case Practical case PU0 PU0 slice 0 slice 0 PU1 PU1 slice 1 slice 1 PU2 PU2 slice 2 slice 2 PU3 PU3 slice 3 slice 3 PU4 PU4 slice 4 slice 4 Bottleneck PU5 PU5 slice 5 slice 5 PU6 PU6 slice 6 slice 6 PU7 PU7 slice 7 slice 7 Encoding time Encoding time
Adaptive Slice-level ParallelismFixed MB assignment • The imbalance of computational load distribution Fast ME / Fast Mode Search Exhaustive Search Method
Adaptive Slice-level ParallelismFixed MB assignment • Computational load for encoding one frame in slice level parallelism • Computation load of the tth frameby a single processor system Ctslice(i) : The computational load of ith slice in tthframe L : Number of slice in a frame
Adaptive Slice-level ParallelismFixed MB assignment • The speedup of multiprocessor system over a single processor system • To achieve the maximum speedup • Computation loads of each slice should be as similar as possible Adaptive slice partition method
Adaptive Slice-level ParallelismComplexity estimation model • A simple estimation method by utilizing the result of fast MB mode selection • Define the group value g corresponding to the candidate MB modes
Adaptive Slice-level ParallelismComplexity estimation model • Complexity model Ck,CHKIntra(g) : Complexity cost of the kth MB g : Group index einter : Estimated complexity cost of inter mode in g = 1 eintra : Complexity cost according to the intra mode check in g = 1 α1, α2, α3, β1 β2 β3 : Weighting values of complexity cost
Adaptive Slice-level ParallelismComplexity estimation model • Relative computational load Assume einter = 1, eintra = 0 CHKintra = 0 α1=2.42, α2=3.12,α3=5.28 CHKintra = 1 Assume einter = 1, eintra = 3.97 β1=0.82, β2=0.83, β3=0.84
Adaptive Slice-level ParallelismAdaptive MB assignment • The total computational load at the tth frame • Ideal computational load of each slice for the uniform workload distribution
Adaptive Slice-level ParallelismAdaptive MB assignment • MB assignment of slice • Much better than fixed MB assignment in each slice
Adaptive Slice-level ParallelismAdaptive MB assignment • Entire block diagram
Experimental ResultsOverview • Performance comparison between proposed MB mode decision and the conventional method • Comparing adaptive slice-level parallelism with fixed slice-level parallelism
Experimental ResultsMB mode selection • Average encoding time saving AST[%] • BDPSNR and BDBR are used to measure the performance against FULL_1Slice FULL_1Slice : Exhaustive method FMD_1Slice : Fast MB mode search method
Experimental Results • R-D performance compared to one slice per frame (FMD_1Slice)
Experimental ResultsSlice-level parallelism • Comparing adaptive and fixed slice level parallelism • Speedup Encoding time of one slice per frame by a single processor system The longest encoding time of a slice using fixed mode The longest encoding time of a slice using adaptive mode
Conclusions • Proposed a fast MB mode selection using consistency history of block size and a zero MV • Proposed a intra mode selection by comparing the correlation • Using these two schemes, they proposed a new adaptive slice-level parallelism to speed up H.264/AVC encoder
Reference • Z. Chen, P. Zhou, Y. He, Fast motion estimation for JVT, JVT Doc.JVT-G016,March 2003. • B. Jeon, J. Lee, Fast mode decision for H.264, JVT-J003, ISO/IEC MPEG and ITU-T VCEG Joint Video Team, (Waikoloa, HI), December 2003. • I. Choi, J. Lee, B. Jeon, Fast coding mode selection with rate-distortion optimization for MPEG-4 Part-10 AVC/H.264, IEEE Trans. Circuits Syst. VideoTechnol. 16 (12) (2006) 1557–1561.