340 likes | 450 Views
Implementation and Parallelization of H.264 Based System on Multi-DSPs Board. 陳奕安 2008.06.11. Outline. System Architecture Multithreading of this system Reference framework 5 Parallelism of H.264 Memory issue. System Architecture. MEX Board 1. PC 1. Capture Frame. H.264 Encode.
E N D
Implementation and Parallelization of H.264 Based System on Multi-DSPs Board • 陳奕安 • 2008.06.11
Outline • System Architecture • Multithreading of this system • Reference framework 5 • Parallelism of H.264 • Memory issue
System Architecture MEX Board 1 PC 1 CaptureFrame H.264 Encode Send to Network PC 2 MEX Board 2 PC 2 Display H.264 Decode Receive from Network
System Architecture H.264 Encode Processing task TX networking task Input task Camera H.264 Decode processing task Output task RX networking task Computer
Host/ MEX Communication MEX Set DSP FIFO Direction Set FIFO Full Flag value DSP FIFO is reset Start EDMA Unreset DSP1 FIFO Clear PCI Interrupt DSP started : fill memory Initialize transfer DSP to PCI transfer request Start Transfer Transfer finished PC PCI started : wait for interrupt Wait for transfer finished Transfer finished Initialize transfer PCI to DSP start transfer request Set transfer size Set PCI FIFO direction Select DSP data sources Set transfer destination address Start PCI FIFO Clear DSP Interrupt Data transfer from the 4 DSP (SDRAM)to PCI[7]
Host/ MEX Communication Data Image
System Architecture H.264 Encode Processing task TX networking task Input task Camera H.264 Decode processing task Output task RX networking task Computer
Networking of H.264 Video • H.264 High Level Architecture Application Supplemental Enhancement Information Reconstructed picture Video Coding Layer Parameter Sets VCL Data Network Abstraction Layer NAL-unit BitstreamAdoption Packet Adoption AVC / H.264 Transport H.320 System MPEG-2 System AVC Storage RTPPayload H.264 VCL and NAL[6]
Networking of H.264 Video Video Packet Application layer • Video Packetization Session layer RTP header Video Packet NAL-Unit of H .264 Transport layer TMS320C600 Network Developer’s Kit UDP header RTP header Video Packet Network layer IP header UDP header RTP header Video Packet Data link layer MAC header IP header UDP header RDP header Video Packet Physical layer
System Architecture H.264 Encode Processing task TX networking task Input task Camera H.264 Decode processing task Output task RX networking task Computer
I/O buffer management • Input buffers • Output buffers Inputing Inputing Head Tail Inputing Head Tail Head Inputing Outputing Tail Outputing Tail Head Head Tail Outputing
I/O buffer management • Input / output buffers Outputing Tail Head Head Inputing Tail Outputing Head Tail Tail Inputing Head Tail Tail Inputing Head Inputing Tail Head Head Head Inputing Tail Head Outputing Outputing Outputing Head Tail
System Architecture • Multithreading of this system H.264 Encode Processing task TX networking task Input task Camera H.264 Decode processing task Output task RX networking task Computer
Reference framework for DSP • Reference framework 5 • DSP/BIOS, • TMS320 DSP Algorithm Standard • Processing flow of RF5 Split Joint F0 V0 task F1 V1 cell F2 V2 channel Fi, Vi XDAIS algorithm 14
Reference framework for DSP • Data communication of RF5 • SIO : Task & Device • SCOM : Task & Task data buffer device driver task data pointer SIO object data buffer task writer task reader task SCOM queue data pointer SCOM message
Reference framework for DSP • Data communication of RF5 • ICC : Cell& Cell 1 2 3 out in out in out in cell data pointer data buffer ICC object describing a buffer element in an a list of pointers to ICC objects
Reference framework for DSP • Application Control of RF5 • Task Receiving both SCOM messages and control messages task SCOM queuefor data messages SCOM message MBX mailbox for control messages
System Architecture • The present system Input task H.264 Encode Processing task Frame i Slice NAL Frame i+1 Rx TX networking task Control task
System Architecture • Multithreading of this system Input task H.264 Encode Processing task MB Frame i MB NAL Frame i+1 MB Rx TX networking task Control task
Parallelizing H.264 • Task-level Decomposition • Divide the algorithm into balance tasks • Accelerate each task • Data-level Decomposition • GOP-level Parallelism • Frame-level Parallelism • Slice-level Parallelism • Macroblock-level Parallelism
H.264 Encoder Block Diagram Dn + X Fn (Current) T Q Reorder Entropy encode NAL - ME Inter F’n-1 (reference) MC P Choose Intra prediction Intra prediction Intra D’n + F’n (reconstructed) Filter T -1 Q-1 uF’n -
H.264 Decoder Block Diagram Inter F’n-1 (reference) MC P Intra prediction Intra D’n + F’n (reconstructed) Filter T -1 Q-1 Reorder Entropy decode uF’n NAL -
Task-level Decomposition • Task profile for H.264 [2]
Parallelizing H.264 • H.264 data structure Video Sequence GOP0 GOP1 GOP2 … GOPn Group of picture …. Fn F2 F1 F0 Slice Slice 0 MB0 MB1 MB2 … MBn Slice 1 Slice 2 Cb …. Cr Slice 3 Frame Y Macroblock
Data-level Decomposition • GOP-level Parallelism • High latency, large memory • Frame-level Parallelism • I, P, B frame imbalance • Slice-level Parallelism • Bitrates increase • Macroblock-level Parallelism
Macroblock-level Parallelism • Spatial parallelism • Temporal parallelism • Spatial & temporal parallelism • Possible data dependencies for macroblock frame i frame i + 1 Intra Pred. MV Pred. Intra Pred. MV Pred. Deblocking Fitler Intra Pred. MV Pred. search window Intra Pred. MV Pred. Deblocking Fitler Current MB 26
Macroblock-level Parallelism • Spatial parallelism MBs processed MBs processing MBs to be process
Macroblock-level Parallelism • Temporal parallelism frame i + 1 frame i MBs processed MBs processing MBs to be process
Macroblock-level Parallelism • Spatial & temporal parallelism frame i + 1 frame i
System Architecture • Multithreading of this system Input task H.264 Encode Processing task MB Frame i MB NAL Frame i+1 MB Rx TX networking task Control task
Memory Issue • Limited memory of DM642 • Use memory buffer to reduce memory access L1P Cache Direct Mapped 16Kbytes Total peripherals DM642 DSP Core L2 Cache/ Memory 256Kbytes Total EDMA Controller L1D Cache 2-way Set Associated 16Kbytes Total Two-level cache architecture of DM642
Memory Issue • Memory hierarchy for inter prediction Memory hierarchy [4]
Memory Issue • Slice memory bufferfor intra prediction and deblocking filter Slice Memory [5]
Reference • [1] Texas Instruments, Incorporated “Reference Frameworks for eXpressDSP Software: RF5, An Extensive, High-Density System.” (spru795a) • [2] TC Chen, HC Fang, CJ Lian, CH Tsai “Algorithm analysis and architecture design for HDTV applications - a look at the H.264/AVC video compressor system “IEEE CIRCUITS & DEVICES MAGAZINE MAY/JUNE 2006 • [3] CorMeenderinck, ArnaldoAzevedo and Ben Juurlink“Parallel Scalability of Video Decoders” April 29, 2008. • [4] Denolf, K. De Vleeschouwer, et al,, “Memory centric design of an MPEG-4 video encoder” , IEEE Trans. CSVT, Vol. 15, No. 5, pp. 609-619, May 2005. • [5] Tsu-Ming Liu et al., “A 125μW, Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications,” ISSCC Digest of Technical Papers, pp. 402-403, Feb. 2006. • [6] T. Wiegand et al., “Overview of H.264/AVC Video Coding Standard”, IEEE Trans. on Circ. and Sys. For Video Technology, Vol. 13, No. 7, pp. 560–576, July 2003.1 • [7] VITEC MULTIMEDIA, “MEX User manual Revision 1.7”.