Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture

Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture Tom R. Jacobs, Vassilios A. Chouliars, and David J. Mulvaney IEEE Transactions on Consumer Electronics

Outline • Introduction • Background knowledge • Main purpose • Previous work • Methodology • Experimental results • Conclusions

IntroductionBackground Knowledge (1/5) • A number of lossy video compression standards have been developed. • MPEG-1, MPEG-2, MPEG4-PART2, H.264 • In order to maintain image quality and reduce bit-rates Additional computation and power consumption

IntroductionBackground Knowledge (2/5) • Such processing-intense consumer application algorithms are generally implemented in System-On-Chip(SOC) devices. • Parallelism • DLP Data-Level Parallelism • TLP Thread-Level Parallelism

IntroductionBackground Knowledge (3/5) • Data-Level Parallelism (DLP) • Distributing the data across different parallel processing nodes. Program: … if CPU="a" then low_limit=1; upper_limit=5 else if CPU="b" then low_limit=6; upper_limit=10 end if do i = low_limit , upper_limit Task on d(i) end do ... end program

IntroductionBackground Knowledge (4/5) Processing node Processing node 1 2 7 10 3 4 5 6 8 9 Data array D of size 10

IntroductionBackground Knowledge (5/5) • Thread-Level Parallelism (TLP) • TLP is the parallelism inherent in an application that runs multiple threads at once. • Benefit- • Distributing the workload of a single high-performance processor among a number of slower and simpler processor cores.

IntroductionMain Purpose (1/2) • Utilizing Thread-Level Parallel (TLP) techniques to improve the performance on video coding. • Reduce DIC (Dynamic Instruction Count). • How to improve? • Workload distribution among a number of parallel-executing processors.

IntroductionMain Purpose (2/2) • The results presented demonstrate that reductions in dynamic instruction count can be achieved.

Previous Work • The majority of this research is focused on coarse-granularity TLP exploitation, with distribution the workload most commonly at GOP level. Little inter-node communication Multi-threading GOP GOP GOP GOP GOP GOP

Previous Work • In 1995, K. Shen, L. A. Rowe, and E.J. Delp implemented parallel MPEG-1 at GOP level. • In 1996, S. Bozoki, S. J. P. Westen, R. L. Lagendijk and J. Biemond performed a comparison between GOP and slice level on MPEG-1.

Previous Work • In 1997, A. Bilas, J. Fritts and J. P. Singh evaluated the performance of MPEG-2 decoders using shared memory system. • Akramullah, Ahmad and Liou implemented a threaded MPEG-2 encoder at the MB level by using local memory.

MethodologyOverview • The threaded MPEG-2 , MPEG-4 and H.264 implemented were compiled on multi-context instruction simulator (MT-ISS) based on SimpleScalar infrastructure. • The most important issue • Data dependancies between processors. • Avoid race hazards.

MethodologyRace hazards Expected condition Error condition Thread 1 Thread 2 Thread 1 Thread 2 1 0 0 1 0 1 1 2 i+1 i+1 i+1 i+1 Race hazards 0 1 1 2 1 0 Integer i Integer i

MethodologyThread-parallel MPEG-2 (1/5) • Test model 5 (TM5) of MPEG-2 encoder is used. • Computation analysis (QCIF) • DIST1  52%~73% of total DIC for a search window of 6 to 62 pels respectively. • FullSearch  3.5%~23.2% of total DIC. • Can be improved by less complex algorithmic ME method. (such as 3-step, 4-step, diamond) • FDCT, and IDCT  2.1%~21% of total DIC.

MethodologyThread-parallel MPEG-2 (2/5)

MethodologyThread-parallel MPEG-2 (3/5) • Motion Estimation • Kernel implementation can take advantage of data parallel techniques. • Store the information in mbinfo structure for motion compensation. • Maintain exclusivity of all variables during the parallel sections.

MethodologyThread-parallel MPEG-2 (4/5) • Forward transform • FDCT first scans the MBs on a row-by-row basis, process these MBs in a row individually. • Determine prediction error and applies the DCT to the block. • Thread-parallel transform function can be performed in block-level.

MethodologyThread-parallel MPEG-2 (5/5) • Inverse transform • IDCT scans the MBs first row-by-row and then block-by-block. • Due to the absence of data dependencies between blocks  Can executed as parallel.

MethodologyThread-parallel MPEG-4 (1/8) • The implementation is based on XviD project with Advanced Simple Profile (ASP). • Bidirectional frames • Quarter-pel motion compensation • Global motion compensation • Trellis quantization • Custom quantization matrices

MethodologyThread-parallel MPEG-4 (2/8) • Computation analysis (QCIF)

MethodologyThread-parallel MPEG-4 (3/8) • The nature of XivD encoder • Intra-frame encoding • Inter-frame encoding

MethodologyThread-parallel MPEG-4 (4/8) • Intra-frame encoding • FrameCodeI (row-by-row for each MBs) • Parallelize the loop for encoding the MBs in a row of the image. • MB data structure  pMB. • Shared memory array. • The highest DIC metric in FrameCodeI is MBTransQuantIntra.

MethodologyThread-parallel MPEG-4 (5/8) • MBTransQuantIntra • Forward transformation, quantization and inverse transformation. • Shared data structure  pEnc • Includes a count of quantization values. • Serial code section. • Transform specific MB pixel data into the frequency domain independently. • MBPrediction and MBCoding • Responsible for VLC and write to bitstream.

MethodologyThread-parallel MPEG-4 (6/8) • Inter-frame encoding • FrameCodeP • Part 1  Motion Estimation • Part 2  Transformation  Quantization  MC

MethodologyThread-parallel MPEG-4 (7/8) • Motion Estimation • Determine a MV for every MB and applies certain criteria to indicate when Intra coding should be used. • Scanning in raster line order. • Two kind of the process • Motion prediction from current frame. • ME relative to reference frames.

MethodologyThread-parallel MPEG-4 (8/8) • Motion Prediction • Examining the MVs in neighbouring MBs and determining an initial estimate for ME. Ideal pattern typical pattern TLP pattern ● ● ● ● ● ● ● ● ● ●

MethodologyH.264 (1/6) • Using x264 for implementation. • Frame slicing • Main problems of using MB-level • Wide variation in processor workload. • The modification of prediction algorithm is needed.

MethodologyH.264 (2/6) • Slice group in H.264 • A group of MBs in a frame. • Can be encoded or decoded separatedly from the remainder of the frame. • Not allowing motion prediction cross slice boundaries. • Drawback • The required bit-rate increase.

MethodologyH.264 (3/6) • Comparison of different slice number

MethodologyH.264 (4/6) • Comparison of different slice number

MethodologyH.264 (5/6) • Different resolution with 4 slices

MethodologyH.264 (6/6) • Computation analysis

Experimental ResultsMPEG-2 Search Range

Experimental ResultsMPEG-4 Quality Setting

Experimental ResultsH.264 Quantization Parameter

Experimental ResultsComparative results

Conclusions • The DIC metric of MPEG-2, MPEG-4, and H.264 can be greatly reduced by TLP. • For HD sequences, the improvement is around 84%, 92%, 96% respectively. • TLP has become more significant for each new generation of video encoders.

Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture

Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture

Presentation Transcript

Multiplexing H.264/AVC Video with MPEG-AAC Audio

Design of a 125  W, Fully-Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications

Transcoding of H.264 bitstream to MPEG-2 bitstream.

MPEG4 Natural Video Coding

H.264 MPEG-4 Codec

Encoding H.264 by Thread Level Parallelism

MPEG Video Coding — MPEG-2

An MPEG-2 To H.264 Transcoder In Baseline Profile

H.264

MPEG Video (Part 2)

Parallel H.264 Decoding on an Embedded Multicore Processor

Fundamentals of Multimedia Chapter 12 MPEG Video Coding II MPEG-4, H.264

Video Transcoding in H.264

H.264 / MPEG-4 Part 10

Introduction to H.264 Video Standard

MPEG Video (Part 2)

Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

H.264/MPEG-4 AVC High Profile Video Encoding System

Inter-Processor Parallel Architecture

MPEG-2 to H.264/AVC Transcoding Techniques

Video Transcoding in H.264