380 likes | 526 Views
Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture. Tom R. Jacobs, Vassilios A. Chouliars, and David J. Mulvaney. IEEE Transactions on Consumer Electronics . Outline. Introduction Background knowledge Main purpose Previous work Methodology
E N D
Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture Tom R. Jacobs, Vassilios A. Chouliars, and David J. Mulvaney IEEE Transactions on Consumer Electronics
Outline • Introduction • Background knowledge • Main purpose • Previous work • Methodology • Experimental results • Conclusions
IntroductionBackground Knowledge (1/5) • A number of lossy video compression standards have been developed. • MPEG-1, MPEG-2, MPEG4-PART2, H.264 • In order to maintain image quality and reduce bit-rates Additional computation and power consumption
IntroductionBackground Knowledge (2/5) • Such processing-intense consumer application algorithms are generally implemented in System-On-Chip(SOC) devices. • Parallelism • DLP Data-Level Parallelism • TLP Thread-Level Parallelism
IntroductionBackground Knowledge (3/5) • Data-Level Parallelism (DLP) • Distributing the data across different parallel processing nodes. Program: … if CPU="a" then low_limit=1; upper_limit=5 else if CPU="b" then low_limit=6; upper_limit=10 end if do i = low_limit , upper_limit Task on d(i) end do ... end program
IntroductionBackground Knowledge (4/5) Processing node Processing node 1 2 7 10 3 4 5 6 8 9 Data array D of size 10
IntroductionBackground Knowledge (5/5) • Thread-Level Parallelism (TLP) • TLP is the parallelism inherent in an application that runs multiple threads at once. • Benefit- • Distributing the workload of a single high-performance processor among a number of slower and simpler processor cores.
IntroductionMain Purpose (1/2) • Utilizing Thread-Level Parallel (TLP) techniques to improve the performance on video coding. • Reduce DIC (Dynamic Instruction Count). • How to improve? • Workload distribution among a number of parallel-executing processors.
IntroductionMain Purpose (2/2) • The results presented demonstrate that reductions in dynamic instruction count can be achieved.
Previous Work • The majority of this research is focused on coarse-granularity TLP exploitation, with distribution the workload most commonly at GOP level. Little inter-node communication Multi-threading GOP GOP GOP GOP GOP GOP
Previous Work • In 1995, K. Shen, L. A. Rowe, and E.J. Delp implemented parallel MPEG-1 at GOP level. • In 1996, S. Bozoki, S. J. P. Westen, R. L. Lagendijk and J. Biemond performed a comparison between GOP and slice level on MPEG-1.
Previous Work • In 1997, A. Bilas, J. Fritts and J. P. Singh evaluated the performance of MPEG-2 decoders using shared memory system. • Akramullah, Ahmad and Liou implemented a threaded MPEG-2 encoder at the MB level by using local memory.
MethodologyOverview • The threaded MPEG-2 , MPEG-4 and H.264 implemented were compiled on multi-context instruction simulator (MT-ISS) based on SimpleScalar infrastructure. • The most important issue • Data dependancies between processors. • Avoid race hazards.
MethodologyRace hazards Expected condition Error condition Thread 1 Thread 2 Thread 1 Thread 2 1 0 0 1 0 1 1 2 i+1 i+1 i+1 i+1 Race hazards 0 1 1 2 1 0 Integer i Integer i
MethodologyThread-parallel MPEG-2 (1/5) • Test model 5 (TM5) of MPEG-2 encoder is used. • Computation analysis (QCIF) • DIST1 52%~73% of total DIC for a search window of 6 to 62 pels respectively. • FullSearch 3.5%~23.2% of total DIC. • Can be improved by less complex algorithmic ME method. (such as 3-step, 4-step, diamond) • FDCT, and IDCT 2.1%~21% of total DIC.
MethodologyThread-parallel MPEG-2 (3/5) • Motion Estimation • Kernel implementation can take advantage of data parallel techniques. • Store the information in mbinfo structure for motion compensation. • Maintain exclusivity of all variables during the parallel sections.
MethodologyThread-parallel MPEG-2 (4/5) • Forward transform • FDCT first scans the MBs on a row-by-row basis, process these MBs in a row individually. • Determine prediction error and applies the DCT to the block. • Thread-parallel transform function can be performed in block-level.
MethodologyThread-parallel MPEG-2 (5/5) • Inverse transform • IDCT scans the MBs first row-by-row and then block-by-block. • Due to the absence of data dependencies between blocks Can executed as parallel.
MethodologyThread-parallel MPEG-4 (1/8) • The implementation is based on XviD project with Advanced Simple Profile (ASP). • Bidirectional frames • Quarter-pel motion compensation • Global motion compensation • Trellis quantization • Custom quantization matrices
MethodologyThread-parallel MPEG-4 (2/8) • Computation analysis (QCIF)
MethodologyThread-parallel MPEG-4 (3/8) • The nature of XivD encoder • Intra-frame encoding • Inter-frame encoding
MethodologyThread-parallel MPEG-4 (4/8) • Intra-frame encoding • FrameCodeI (row-by-row for each MBs) • Parallelize the loop for encoding the MBs in a row of the image. • MB data structure pMB. • Shared memory array. • The highest DIC metric in FrameCodeI is MBTransQuantIntra.
MethodologyThread-parallel MPEG-4 (5/8) • MBTransQuantIntra • Forward transformation, quantization and inverse transformation. • Shared data structure pEnc • Includes a count of quantization values. • Serial code section. • Transform specific MB pixel data into the frequency domain independently. • MBPrediction and MBCoding • Responsible for VLC and write to bitstream.
MethodologyThread-parallel MPEG-4 (6/8) • Inter-frame encoding • FrameCodeP • Part 1 Motion Estimation • Part 2 Transformation Quantization MC
MethodologyThread-parallel MPEG-4 (7/8) • Motion Estimation • Determine a MV for every MB and applies certain criteria to indicate when Intra coding should be used. • Scanning in raster line order. • Two kind of the process • Motion prediction from current frame. • ME relative to reference frames.
MethodologyThread-parallel MPEG-4 (8/8) • Motion Prediction • Examining the MVs in neighbouring MBs and determining an initial estimate for ME. Ideal pattern typical pattern TLP pattern ● ● ● ● ● ● ● ● ● ●
MethodologyH.264 (1/6) • Using x264 for implementation. • Frame slicing • Main problems of using MB-level • Wide variation in processor workload. • The modification of prediction algorithm is needed.
MethodologyH.264 (2/6) • Slice group in H.264 • A group of MBs in a frame. • Can be encoded or decoded separatedly from the remainder of the frame. • Not allowing motion prediction cross slice boundaries. • Drawback • The required bit-rate increase.
MethodologyH.264 (3/6) • Comparison of different slice number
MethodologyH.264 (4/6) • Comparison of different slice number
MethodologyH.264 (5/6) • Different resolution with 4 slices
MethodologyH.264 (6/6) • Computation analysis
Experimental ResultsMPEG-2 Search Range
Experimental ResultsMPEG-4 Quality Setting
Experimental ResultsH.264 Quantization Parameter
Conclusions • The DIC metric of MPEG-2, MPEG-4, and H.264 can be greatly reduced by TLP. • For HD sequences, the improvement is around 84%, 92%, 96% respectively. • TLP has become more significant for each new generation of video encoders.