1 / 27

Entropy Coding on a Programmable Processor Array for Multimedia SoC

Entropy Coding on a Programmable Processor Array for Multimedia SoC. Roberto R. Osorio and Javier D. Bruguera University of Santiago de Compostela. SPAIN Dept. Electronic and Computer Engineering e-mail: (roberto,bruguera)@dec.usc.es. Outline. Entropy coding Relevance Complexity

loan
Download Presentation

Entropy Coding on a Programmable Processor Array for Multimedia SoC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entropy Coding on a Programmable Processor Array for Multimedia SoC Roberto R. Osorio and Javier D. Bruguera University of Santiago de Compostela. SPAIN Dept. Electronic and Computer Engineering e-mail: (roberto,bruguera)@dec.usc.es USC 2007

  2. Outline • Entropy coding • Relevance • Complexity • Options for implementation • Application-specific accelerators • Reconfigurable instruction-set extensions • Programmable processors • ASIPs • Our proposal as a processors array • Implementation view • Implementation details • Results and conclusions

  3. Entropy coding • Lossless data compression • More probable symbols (events) → short codewords • Less probable symbols → long codewords • It is a critical task in implementing multimedia standards • It is more than just Huffman or arithmetic coding • Zig-zag, run-length, binarization, context selection,... • Focusing just on pure entropy coding renders poor acceleration • On JPEG-2000 represents more than 50% of computations • On other standards is just 5-10%, however... • 10% can be a lot in video encoding • It does not benefit from SIMD or MIMD due to: • Data dependencies • Bit-level operations

  4. Options for implementation • Application-specific hardware • Highest performance • High throughput, low latency and low power consumption • Optimized integration reduces latency and cost • Painful design process • Skilled engineers needed • Complex implementation. Errors may show up after taping out • No flexibility: one design → one or two applications • Reconfigurable instruction-sets or accelerators • High flexibility: one application → one design • Errors can be corrected at (almost) any time • Still, many times slower, bigger and power hungry than an ASIC • Painful design process • Skilled engineers • Benefits of accelerating small kernels limited by Amdahl's law

  5. Options for implementation (2) • Programmable processors • Limited performance, high power consumption • Several choices • Scalar processors → poor performance • You get what you paid for • Super scalar → high power consumption • Diminishing returns • VLIW → something in between • Preferred choice for implementing multimedia systems • Performance suffers due to data dependencies • Best flexibility • One design → any application • Changes can be applied on the field

  6. 2.5 10 50 200 Gops !! M. symbols / s RISC or VLIW 1 0 1 1 1 0 0 0 0 ~50 ops / 0 0 1 1 0 1 0 1 0 1 0 0 1 symbol Entropy coding on programmable processors • Example application • Context-adaptive Binary Arithmetic Coder (CABAC) in H.264 • Data binarization • Context selection and updating • Binary arithmetic coding • Bit-stream formation • The number of operations in high-quality encoding scenarios is overwhelming!

  7. MPEG-4. Encoder VGA resolution @30fps 4.1 GIPS HW SW Low cost Greater flexibility Exploration 0 RISC 21 RISC SW: 5 RISC, 4 threads SW: 15 RISC, 16 threads (88% utilization) (75% utilization) PierrePaulin Coproc: Clip Div Abs Sgn Coproc: Clip Div Abs Sgn ST Microelectronics Euromicro DSD 2004 HW (80% performance) HW (65% performance) DCT, SAD, DCT, SAD BDIFF, BADD, BQ, BIQ Hardware-software co-design • Need for efficient implementations • Processing speed • Power consumption

  8. Motivation for a new platform Applications Image visualization Video playing Music Sound recording Still digital cameras Video cameras Digital TV Time shifting Multiple tuners Continuous recording … Formats JPEG GIF PNG TIFF JPEG 2000 MPEG-1 MPEG-2 MPEG-4 SP H.264 WMV QuickTime PDF … Algorithms Huffman Q-Coder QM-Coder MQ-Coder CABAC Rice Golomb Exp-Golomb Lempel-Ziv Run-length … • Devices

  9. 1637 1500 Thousands of lines Engineers x month 1022 1000 500 500 350 146 50 13 5 1990’s 2002 3G 2010 Source: TI 2002 Motivation for a new platform • Increasing complexity Support multiple standards; services; applications + Complexity grows quadratically with the size of the problem + Implementation for heterogeneous platforms

  10. ASIP • Application-Specific Instruction-set Processor • Tailored to a given range of applications • Best performance and lower cost for a programmable processor • Still retains high flexibility • Design process • From scratch • From a base processor • Profiling • Adding new instructions / removing unused ones • Adding / removing functional units • Tailoring instruction format and signal widths • Other alternatives • Tensilica

  11. Local memory Local memory   Processor Processor Local memory Local memory mem mem mem mem       Processor Processor P P P P Our ASIP implementation • Array of low cost processors • 8-bit processors • 2-stage pipeline: fetch/decode and execute • 2 instructions per cycle in a VLIW fashion • Each processor has its own data and code memories • Communication through queues • A linear structure has been found to be sufficient so far • Global memory accessed through a shared bus

  12. Program local memory Fetch & decoding Flow control Pipeline registers Registers bank 8 8 8 8 Data local memory Architecture

  13. Instruction set • 8-bit instructions • add and sub with and without carry • and, or, exor • left and rigth shift and rotation (only 1 bit each time)‏ • conditional (zero, carry) and unconditional branch • memory load and store • data and code prefetch • queue input and output • 16- bit instructions: carry bit passes to the next ALU • We do not implement • call and return • put an address in the queue for next processor • jump to an address in the queue • stack management • interrupts

  14. Programming model • Start up • First processor reads starting address from the queue • Initialization subroutine puts an address for the next processor • After a few cycles, all processors are up • Processing • Each processor executes a part of the code and communicates with other processors using the queues • Processors read the queues at specific points in their code • Empty/full queues make processors stall • The same applies for data or code not present in the local memory • Switching to another subroutine • When the work is done, processors read a new address from the queue • Some processors always execute the same piece of code

  15. LOOP LOOP Call Return Ideal structure for(…){ for(…){ for(…){ for(…){ ….. ….. ….. ….. } } } } Call Return Call Return Distributing the code Data binarization: Context modelling: Encoding iteration: Output:

  16. Case study • CABAC encoding in H.264 • Follows a pipelined structure • Irregular algorithms • Not well suited for software pipelining • Zig-zag coefficient ordering: LUT-based indirections • Binarization: data dependencies • Context managing: Table accessing and updating • Binary arithmetic coding: Bit-level operations and data dependencies • JPEG encoding • Zig-zag coefficient ordering: LUT-based indirections • Token formation: data dependencies • Huffman encoding: bit manipulation

  17. Results • Comparing with a TI TMS320C6711 VLIW DSP • 5 of our processors were used in both cases • CABAC • 10 macroblocks from the 3rd frame of Foreman QCIF encoded as a P-frame with quantizer 28 • JPEG • 10 macroblocks from Lena image with quality level 75

  18. Other algorithms • We expect other encoding algorithms to perform similar to the proposed ones: • CAVLC in H.264 • Huffman in MPEG-2 and 4 • EBCOT in JPEG-2000,... • Decoding presents serious data dependencies • We have studied CABAC decoding • We have being working on reducing the impact of data dependencies • At this moment we do not have: • A whole implementation • An efficient implementation on other platform to compare with

  19. Other algorithms CABAC encoding Zig-Zag Significance map Contexts Encoding Bit-stream quantization Significant coefficients modeling iteration formation H.264 CABAC decoding Zig-Zag Arithmetic Bit-stream Contexts modeling de-quantization H.264 decoding parsing Coefficients reconstruction Run-legth Zig-Zag Bit-stream Huffman JPEG Coefficients quantization formation encoding encoder processing JPEG Zig-Zag Bit-stream Coefficients Huffman decoder de-quantization parsing reconstruction decoding JPEG 2000 Context Encoding Bit-stream Ebcot Ebcot 1.1 modeling iteration formation 1.2 encoder JPEG 2000 Ebcot 1.1 Bit-stream Arithmetic Ebcot decoder Context modeling parsing decoding 1.2

  20. Data reconstruction: Context modelling: Decoding iteration: Data dependencies in the decoder • Data binarization: • Context modeling: • Dfskdfjkadsfsa sa • kf s faskfj saf • ds skfj • Encoding iteration: • Output: • Context modeling: • Dfskdfjkadsfsa sfully prog • Ramm • able processor • able to implem • ent an • y encoding or • ecoding algorithm w • ith high efficiency • Able to switch to anot • her a • lgorithm in a • short time • With a performance in be • tween a programmable pro • essor an • d a hardware acceleratora • Data binarization: • Context modeling: • Dfskdfjkadsfsa sa • kf s faskfj saf • ds skfj • Encoding iteration: • Dfskdfjkadsfsa sfully prog • Ramm • able processor • able to implem • ent an • y encoding or • ecoding algorithm w • ith high efficiency • Able to switch to anot • her a • lgorithm in a • short time • With a performance in be • tween a programmable pro • essor an • Data binarization: • Context modeling: • Dfskdfjkadsfsa sa • kf s faskfj saf • Output: • Context modeling: • Dfskdfjkadsfsa sfully prog • Ramm • able processor • able to implem • ent an • y encoding or • ecoding algorithm w • ith high efficiency • Able to switch to anot

  21. ~ Work around • data_reconstruction(…){ • … • do{ • … • context_modeling(…) • … • use_value • … • } • … • } • context_modeling(…){ • … • … • decoding_iteration(…) • … • use_value • … • } • decoding_iteration(…){ • … • … • … • … • } • INLINING • data_reconstruction(…){ • … • do{ • … • // context_modeling • … • … • decoding_iteration(…) • … • … • use_value • … • } • … • } • decoding_iteration(…){ • … • … • … • … • } • CODE REDISTRIBUTION • data_reconstruction(…){ • … • do{ • decoding_iteration(…) • … • … • … • … • … • … • use_value • } • … • } • decoding_iteration(…){ • … • … • … • … • } • data_reconstruction(…){ • … • do{ • … • context_modeling(…) • … • } • … • } • context_modeling(…){ • … • … • decoding_iteration(…) • … • … • } • decoding_iteration(…){ • … • … • … • … • }

  22. bzr 100 input $2 output $4 xor $0 $0 add $0 1 output $2 fetch $4 and $4 7 add $1 4 sl0 $4 add $4 $5 ASIC + FPGA coarse grain Applications media processor begin -- registers clocking SYNC: process (clk, reset)‏ begin if(clk'event and clk = '1') then if(reset = '1') then codigoOutReg <= "0000"; numSeqOutReg <= "000"; calcSreg <= "0000000000000000"; calcCreg <= "0000000000000000"; shiftOutReg <= "000";

  23. mem mem mem mem mem mem mem mem         P P P P P P P P mem mem mem mem mem mem       P P P P P P • Reduce voltage • Reduce clock frequency Implementation issues Yield Utilization Power

  24. Mem I/O P I/O Mem I/O Mem P P P P P P . r f i P e I I I I r . S S S S d l C t I e d i S E E o o t F M A A A A l c M M i A C F e D P P P P T P P P P T I I I I I I I I C C E E S S S S S S S S D M M A A A A D i A A A A Mem Mem I/O I/O Mem I/O An ASIP-based media-processor

  25. Implementation results • Area and speed figures for the proposed processor using AMS 0.35µ libraries

  26. Comparison • Approximate comparison of the hardware cost of a 5-element processors array and a TI C6711 VLIW DSP

  27. Conclusions • Entropy coding is a complex task in multimedia applications that often needs of hardware acceleration • The implementation cost and lack of flexibility demand programmable solutions with comparable performance • ASIPs are a intermediate solution between hardware accelerators and general purpose processors • In this work an ASIP is proposed for entropy encoding • This ASIP is not based on optimized new instructions but on achieving high parallelism in computations and data flow • Results demonstrate that this is a valid approach for the applications we have studied • We pretend to extend the results to other applications

More Related