Parallel H.264 Decoding on an Embedded Multicore Processor

Parallel H.264 Decoding on an EmbeddedMulticore Processor Arnaldo Azevedo1, Cor Meenderink1, Ben Juurlink1 Andrei Terechko2, Jan Hoogerbrugge2, Mauricio Alvarez3, Alex Ramirez3,4 1 - Delft University of Technology, Netherlands 2 - NXP, Netherlands 3 - Barcelona Supercomputing Center, Spain 4 - Universitat Politecnica de Catalunya, Spain HIPEAC (The 4th International Conference on High Performance and Embedded Architectures and Compilers) 2009

Outline • Introduction • 3D-Wave • 3D-Wave Implementation • Experimental Results • Conclusions

Introduction • Industry shift to multicores • Increasing demand for higher media quality/resolution • Efficient and scalable exploitation of multicore architectures for video coding • H.264 is widely used and computationally demanding • Decoding is part of encoding and more challenging

Parallel H.264 Decoding The H.264 Decoder Encoded Bitstream Inverse Quantization Inverse DCT Stream Parsing Entropy Decoder Deblocking + Spatial Prediction Motion Compensation Reference Frames Reconstructor Data-Parallel Processing The H.264 decoding process http://www.powercam.cc/slide/1580 Parser

Slice 1 Slice 2 Slice 3 H.264 Parallelization • Frame-level • Motion Compensation introducesinter-frame dependencies • Frame-level parallelism is very limited • Slice-level • Slice-level parallelism is uncertain and increase bitrate P3 P6 P9 I0 B4 B1 B2 B5

Intra Intra DF Intra Intra DF Current MB H.264 ParallelizationMacroBlock-level 2D-Wave: exploits MB-level parallelism

Intra Intra DF Intra Intra DF Current MB H.264 ParallelizationMacroBlock-level 2D-Wave: Exploits MB-level parallelism Full HD: up to 60 MBs in parallel

H.264 Parallelizationoverview current strategies • Frame-level: • very limited parallelism • Slice-level: • uncertain parallelism • increases bitrate • MB-level: • Reasonable parallelism • None of these is sufficient to leverage a many-core!

3D-Wave motion compensation frame 0 (I) frame 1 (P)‏ frame 2 (P)‏

3D-Wavemaximum parallelism For full HD: Maximum availableparallelism ranges from 5000-9000 MBs! Note: This requires >200frames in flight.

3D-Wave Implementation • 3D-Wave was implemented on an NXP multicore consisting of TM3270 Trimedias • TM3270 was projected for SD video processing • VLIW-based media-processor with SIMD support • In-house simulator capable of simulating up to 64 cores • 2D-Wave was already implemented • Tail submit (proposed by Hoogerbrugge, Terechko)[13] • Checks the right and down-left MBs • Execute one of them if ready, send other to TQ [13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.

3D-Wave ImplementationReference Frame Buffer Structure Reference Frame Buffer Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Decoder Sync info Frame 5 Reference Frame Buffer Structure

3D-Wave ImplementationReference Frame Buffer Structure Decoder Frame 0 Frame 1 Frame 3 Frame 4 Frame 2 Sync info Sync info Sync info Sync info Sync info Parallel Reference Frame Buffer Structure

3D-Wave ImplementationReference Frame Buffer Structure Decoder Decoder Decoder Frame 0 Frame 1 Frame 3 Frame 4 Frame 2 Sync info Sync info Sync info Sync info Sync info Parallel Reference Frame Buffer Structure

Ref MB F1;MB(1,3)‏ NULL 3D-Wave ImplementationInter frame dependencies • mb_decode checks inter frame dependencies • On failure, it inserts the MB in the Kick-Off List of the Ref MB Frame 0 Frame 1

Ref MB F1;MB(1,3)‏ NULL 3D-Wave Implementation Inter frame dependencies • Decoding process continues normally Frame 0 Frame 1

3D-Wave Implementation Inter frame dependencies • mb_decode checks Kick-Off List and submits subscribed tasks Frame 0 Frame 1 Ref MB F1;MB(1,3)‏ NULL

3D-Wave Implementation Inter frame dependencies • And the decoding process carries on Frame 0 Frame 1 Ref MB NULL

3D-Wave ImplementationFrame Scheduling • 3D-Wave can have many of frames in flight • Practical implementation requires few frames in flight • A policy was developed to limit the number of frames in flight • Implementation • uses the Kick-Off List • subscribes the first MB of the next frame to a specific MB in the current frame • position of the MB defines number of frames in flight

3D-Wave ImplementationFrame Priority • Frame latency is an important factor in video decoding • 3D-Wave interleaves the processing of all frames in flight • Frame Priority is necessary to limit frame latency in 3D-Wave • Implementation • splits the Task Queue(TQ) into highandlow priority task queues • sends the tasks of the frame next-in-line to the high priority task queue • checks if there are tasks in the high priority TQ, executes from the low priority TQ otherwise

Experimental Results • Use the NXP H.264 decoder that is highly optimized. • Machine-dependent optimizations (e.g. SIMD operations) • Machine-independent optimizations (e.g. code restructuring) • The experiments use all 4 videos from the HD-VideoBench[10]. [10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.

Experimental ResultsMethodology • Entropy Decoding results of the entire sequence are buffered • Sequence contains only I and P frames with one slice • All frames are scheduled to execute at once • Reference Frame Buffer keeps all the frames of the sequence • Presented results are for 25 frames (1 second) of Rush_Hour Full High Definition(FHD) • On a single core, 2D-Wave can decode 39 SD,18 HD, and 8 FHD frames per second, respectively.

Experimental ResultsScalability • Efficiency of more than 80% for 64 cores • Start-up and ramp-down times of short sequence limit efficiency • 64 cores is 16x faster than real-time for FHD

Experimental ResultsFrame Scheduling FHD Rush_Hour decoding on 16 cores • Different colors represent different frames • Frame Scheduling limits the number of frames in flight • Performance loss is < 5% for at most 6 frames in flight

Experimental ResultsFrame Scheduling and Priority FHD Rush_Hour decoding on 16 cores • Frame Priority reduces frame latency to the same as 2D-Wave (10ms) • The latency of the 1st frame: 58.5ms  Frame Scheduling(15.1ms)  Frame Scheduling and Priority(9.2ms) • Does not reduce performance significantly (< 1%)

Experimental Results Bandwidth Requirements • Bandwidth required for 64 cores is approximately 21 GB/s • 3D-Wave is 20% more bandwidth efficient than 2D-Wave • Scheduling and Priority reduce locality and increase bandwidth

Conclusions • 3D-Wave scales with high efficiency to large number of cores • 3D-Wave allows efficient use of many-cores architectures for video processing • Frame priority reduces latency to its minimum

References • [3] Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B., Ramirez, A.: “Parallel Scalability of H.264,” First Workshop on Programmability Issues for Multi-Core Computers 2008. • [10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007. • [13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008. • M. Alvarez, A. Ramirez, M. Valero, A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, “Performance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture,” The 4CCC: 4th Colombian Computing Conference, Bucaramanga, Colombia, April 2009. • A. Azevedo, B.H.H. Juurlink, C.H. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, M. Valero, “A Highly Scalable Parallel Implementation of H.264,” Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), September 2009.

Parallel H.264 Decoding on an Embedded Multicore Processor