270 likes | 370 Views
Entropy Coding on a Programmable Processor Array for Multimedia SoC. Roberto R. Osorio and Javier D. Bruguera University of Santiago de Compostela. SPAIN Dept. Electronic and Computer Engineering e-mail: (roberto,bruguera)@dec.usc.es. Outline. Entropy coding Relevance Complexity
E N D
Entropy Coding on a Programmable Processor Array for Multimedia SoC Roberto R. Osorio and Javier D. Bruguera University of Santiago de Compostela. SPAIN Dept. Electronic and Computer Engineering e-mail: (roberto,bruguera)@dec.usc.es USC 2007
Outline • Entropy coding • Relevance • Complexity • Options for implementation • Application-specific accelerators • Reconfigurable instruction-set extensions • Programmable processors • ASIPs • Our proposal as a processors array • Implementation view • Implementation details • Results and conclusions
Entropy coding • Lossless data compression • More probable symbols (events) → short codewords • Less probable symbols → long codewords • It is a critical task in implementing multimedia standards • It is more than just Huffman or arithmetic coding • Zig-zag, run-length, binarization, context selection,... • Focusing just on pure entropy coding renders poor acceleration • On JPEG-2000 represents more than 50% of computations • On other standards is just 5-10%, however... • 10% can be a lot in video encoding • It does not benefit from SIMD or MIMD due to: • Data dependencies • Bit-level operations
Options for implementation • Application-specific hardware • Highest performance • High throughput, low latency and low power consumption • Optimized integration reduces latency and cost • Painful design process • Skilled engineers needed • Complex implementation. Errors may show up after taping out • No flexibility: one design → one or two applications • Reconfigurable instruction-sets or accelerators • High flexibility: one application → one design • Errors can be corrected at (almost) any time • Still, many times slower, bigger and power hungry than an ASIC • Painful design process • Skilled engineers • Benefits of accelerating small kernels limited by Amdahl's law
Options for implementation (2) • Programmable processors • Limited performance, high power consumption • Several choices • Scalar processors → poor performance • You get what you paid for • Super scalar → high power consumption • Diminishing returns • VLIW → something in between • Preferred choice for implementing multimedia systems • Performance suffers due to data dependencies • Best flexibility • One design → any application • Changes can be applied on the field
2.5 10 50 200 Gops !! M. symbols / s RISC or VLIW 1 0 1 1 1 0 0 0 0 ~50 ops / 0 0 1 1 0 1 0 1 0 1 0 0 1 symbol Entropy coding on programmable processors • Example application • Context-adaptive Binary Arithmetic Coder (CABAC) in H.264 • Data binarization • Context selection and updating • Binary arithmetic coding • Bit-stream formation • The number of operations in high-quality encoding scenarios is overwhelming!
MPEG-4. Encoder VGA resolution @30fps 4.1 GIPS HW SW Low cost Greater flexibility Exploration 0 RISC 21 RISC SW: 5 RISC, 4 threads SW: 15 RISC, 16 threads (88% utilization) (75% utilization) PierrePaulin Coproc: Clip Div Abs Sgn Coproc: Clip Div Abs Sgn ST Microelectronics Euromicro DSD 2004 HW (80% performance) HW (65% performance) DCT, SAD, DCT, SAD BDIFF, BADD, BQ, BIQ Hardware-software co-design • Need for efficient implementations • Processing speed • Power consumption
Motivation for a new platform Applications Image visualization Video playing Music Sound recording Still digital cameras Video cameras Digital TV Time shifting Multiple tuners Continuous recording … Formats JPEG GIF PNG TIFF JPEG 2000 MPEG-1 MPEG-2 MPEG-4 SP H.264 WMV QuickTime PDF … Algorithms Huffman Q-Coder QM-Coder MQ-Coder CABAC Rice Golomb Exp-Golomb Lempel-Ziv Run-length … • Devices
1637 1500 Thousands of lines Engineers x month 1022 1000 500 500 350 146 50 13 5 1990’s 2002 3G 2010 Source: TI 2002 Motivation for a new platform • Increasing complexity Support multiple standards; services; applications + Complexity grows quadratically with the size of the problem + Implementation for heterogeneous platforms
ASIP • Application-Specific Instruction-set Processor • Tailored to a given range of applications • Best performance and lower cost for a programmable processor • Still retains high flexibility • Design process • From scratch • From a base processor • Profiling • Adding new instructions / removing unused ones • Adding / removing functional units • Tailoring instruction format and signal widths • Other alternatives • Tensilica
Local memory Local memory Processor Processor Local memory Local memory mem mem mem mem Processor Processor P P P P Our ASIP implementation • Array of low cost processors • 8-bit processors • 2-stage pipeline: fetch/decode and execute • 2 instructions per cycle in a VLIW fashion • Each processor has its own data and code memories • Communication through queues • A linear structure has been found to be sufficient so far • Global memory accessed through a shared bus
Program local memory Fetch & decoding Flow control Pipeline registers Registers bank 8 8 8 8 Data local memory Architecture
Instruction set • 8-bit instructions • add and sub with and without carry • and, or, exor • left and rigth shift and rotation (only 1 bit each time) • conditional (zero, carry) and unconditional branch • memory load and store • data and code prefetch • queue input and output • 16- bit instructions: carry bit passes to the next ALU • We do not implement • call and return • put an address in the queue for next processor • jump to an address in the queue • stack management • interrupts
Programming model • Start up • First processor reads starting address from the queue • Initialization subroutine puts an address for the next processor • After a few cycles, all processors are up • Processing • Each processor executes a part of the code and communicates with other processors using the queues • Processors read the queues at specific points in their code • Empty/full queues make processors stall • The same applies for data or code not present in the local memory • Switching to another subroutine • When the work is done, processors read a new address from the queue • Some processors always execute the same piece of code
LOOP LOOP Call Return Ideal structure for(…){ for(…){ for(…){ for(…){ ….. ….. ….. ….. } } } } Call Return Call Return Distributing the code Data binarization: Context modelling: Encoding iteration: Output:
Case study • CABAC encoding in H.264 • Follows a pipelined structure • Irregular algorithms • Not well suited for software pipelining • Zig-zag coefficient ordering: LUT-based indirections • Binarization: data dependencies • Context managing: Table accessing and updating • Binary arithmetic coding: Bit-level operations and data dependencies • JPEG encoding • Zig-zag coefficient ordering: LUT-based indirections • Token formation: data dependencies • Huffman encoding: bit manipulation
Results • Comparing with a TI TMS320C6711 VLIW DSP • 5 of our processors were used in both cases • CABAC • 10 macroblocks from the 3rd frame of Foreman QCIF encoded as a P-frame with quantizer 28 • JPEG • 10 macroblocks from Lena image with quality level 75
Other algorithms • We expect other encoding algorithms to perform similar to the proposed ones: • CAVLC in H.264 • Huffman in MPEG-2 and 4 • EBCOT in JPEG-2000,... • Decoding presents serious data dependencies • We have studied CABAC decoding • We have being working on reducing the impact of data dependencies • At this moment we do not have: • A whole implementation • An efficient implementation on other platform to compare with
Other algorithms CABAC encoding Zig-Zag Significance map Contexts Encoding Bit-stream quantization Significant coefficients modeling iteration formation H.264 CABAC decoding Zig-Zag Arithmetic Bit-stream Contexts modeling de-quantization H.264 decoding parsing Coefficients reconstruction Run-legth Zig-Zag Bit-stream Huffman JPEG Coefficients quantization formation encoding encoder processing JPEG Zig-Zag Bit-stream Coefficients Huffman decoder de-quantization parsing reconstruction decoding JPEG 2000 Context Encoding Bit-stream Ebcot Ebcot 1.1 modeling iteration formation 1.2 encoder JPEG 2000 Ebcot 1.1 Bit-stream Arithmetic Ebcot decoder Context modeling parsing decoding 1.2
Data reconstruction: Context modelling: Decoding iteration: Data dependencies in the decoder • Data binarization: • Context modeling: • Dfskdfjkadsfsa sa • kf s faskfj saf • ds skfj • Encoding iteration: • Output: • Context modeling: • Dfskdfjkadsfsa sfully prog • Ramm • able processor • able to implem • ent an • y encoding or • ecoding algorithm w • ith high efficiency • Able to switch to anot • her a • lgorithm in a • short time • With a performance in be • tween a programmable pro • essor an • d a hardware acceleratora • Data binarization: • Context modeling: • Dfskdfjkadsfsa sa • kf s faskfj saf • ds skfj • Encoding iteration: • Dfskdfjkadsfsa sfully prog • Ramm • able processor • able to implem • ent an • y encoding or • ecoding algorithm w • ith high efficiency • Able to switch to anot • her a • lgorithm in a • short time • With a performance in be • tween a programmable pro • essor an • Data binarization: • Context modeling: • Dfskdfjkadsfsa sa • kf s faskfj saf • Output: • Context modeling: • Dfskdfjkadsfsa sfully prog • Ramm • able processor • able to implem • ent an • y encoding or • ecoding algorithm w • ith high efficiency • Able to switch to anot
~ Work around • data_reconstruction(…){ • … • do{ • … • context_modeling(…) • … • use_value • … • } • … • } • context_modeling(…){ • … • … • decoding_iteration(…) • … • use_value • … • } • decoding_iteration(…){ • … • … • … • … • } • INLINING • data_reconstruction(…){ • … • do{ • … • // context_modeling • … • … • decoding_iteration(…) • … • … • use_value • … • } • … • } • decoding_iteration(…){ • … • … • … • … • } • CODE REDISTRIBUTION • data_reconstruction(…){ • … • do{ • decoding_iteration(…) • … • … • … • … • … • … • use_value • } • … • } • decoding_iteration(…){ • … • … • … • … • } • data_reconstruction(…){ • … • do{ • … • context_modeling(…) • … • } • … • } • context_modeling(…){ • … • … • decoding_iteration(…) • … • … • } • decoding_iteration(…){ • … • … • … • … • }
bzr 100 input $2 output $4 xor $0 $0 add $0 1 output $2 fetch $4 and $4 7 add $1 4 sl0 $4 add $4 $5 ASIC + FPGA coarse grain Applications media processor begin -- registers clocking SYNC: process (clk, reset) begin if(clk'event and clk = '1') then if(reset = '1') then codigoOutReg <= "0000"; numSeqOutReg <= "000"; calcSreg <= "0000000000000000"; calcCreg <= "0000000000000000"; shiftOutReg <= "000";
mem mem mem mem mem mem mem mem P P P P P P P P mem mem mem mem mem mem P P P P P P • Reduce voltage • Reduce clock frequency Implementation issues Yield Utilization Power
Mem I/O P I/O Mem I/O Mem P P P P P P . r f i P e I I I I r . S S S S d l C t I e d i S E E o o t F M A A A A l c M M i A C F e D P P P P T P P P P T I I I I I I I I C C E E S S S S S S S S D M M A A A A D i A A A A Mem Mem I/O I/O Mem I/O An ASIP-based media-processor
Implementation results • Area and speed figures for the proposed processor using AMS 0.35µ libraries
Comparison • Approximate comparison of the hardware cost of a 5-element processors array and a TI C6711 VLIW DSP
Conclusions • Entropy coding is a complex task in multimedia applications that often needs of hardware acceleration • The implementation cost and lack of flexibility demand programmable solutions with comparable performance • ASIPs are a intermediate solution between hardware accelerators and general purpose processors • In this work an ASIP is proposed for entropy encoding • This ASIP is not based on optimized new instructions but on achieving high parallelism in computations and data flow • Results demonstrate that this is a valid approach for the applications we have studied • We pretend to extend the results to other applications