Progress on media processor design

Progress on media processor design Presented by Chunyue Liu (liucy@vlsi.zju.edu.cn) Xiaolang Yan (yan@vlsi.zju.edu.cn) Xing Qin (qinx@vlsi.zju.edu.cn) Jian Yang (yangj@vlsi.zju.edu.cn) Xiaohua Luo (luoxh@vlsi.zju.edu.cn) Peiyong Zhang (zhangpy@vlsi.zju.edu.cn) Dake Liu (dake@isy.liu.se) Embedded DSP Research & Develop Group

Outline • Overview of media processor • Progress on Spock • Progress on Schubert - Overview - Key features - Performance • Conclusions & Problems

Background and Challenges • Media applications have very high computation complexity - H.264 encoding of 720 x 576 pixels @ 30 frames /s up to 30 GOPS • Media processor is on the demand - Some state of art Media Processors (e.g. Nomatic, da Vinci) • Multiple standards coexist -Flexible & programmable • Our current IC design level constraint (200MHz@.18um) • ASIP is the best choice • Our proposal on IC-DFN’05

Overview of media processor • Programmable and heterogeneous processors on a SoC platform - General MCU(CK510, a 32-bit RISC core) Interface (GUI), Os (Linux) - Enhanced DSP (Spock) Audio processing, Bitstream parsing, Data transferring - Vector processor (Schubert) Video processing

Progress on Spock • Developed tools chain - Assembler, Simulator and Debugger • FPGA prototype: real time decoding -128kb/s OGG @ 40MHz • To test Spock , Dual-core SoC platform is developed - Integrated with CK510 - Inter-processor communication uses mailbox and shared memory -.18um, less than 500mw ,166MHz - CK510 core area: 2 x 2 mm2 - Spock core area: 1.5 x 1.5 mm2

Overview of Spock • Optimization for Control - Branch optimization: conditional execution 2-level hardware loop, repeat • Optimization for Signal Processing - Multiple addressing mode: Post address ++/-- Reverse/module addressing -MAC with parallel load -VLX instruction set extension: putbits, showbits, getbits, etc.

Progress on Schubert Application coverage to function coverage • Design Methodology • Released 316 novel instructions - SIMD and RISC • Developed tools chain - Assembler - Cycle-accurate Simulator • Mapped kernels H.264/AVC - IT/IIT, Intra/inter-prediction - de-blocking, Motion estimation MPEG2 - DCT, Motion compensation • Micro-Architecture is designed estimated area: 3.5 x 3.5 mm2@.18um with a 70KB SRAM SW-HW partition: 10%-90% locality Assembly instruction set specification Design of Assembler and Simulator Build golden model Benchmark instruction set Behavior function verification Good performance? Micro-architecture design RTL coding Design for test Backend design RTL code verification Test chip fabrication & test board prototype

Key features of Schubert • Dual clusters and dual coupling pipelines - SIMD combined with VLIW architecture • Explicit Data Organization SIMD (EDO-SIMD) • 2-Dimensional and byte-align addressing storage • Cycle accurate instruction set simulator

Dual clusters and dual coupling pipelines • Two clusters: - Cluster0: Computation (+/-,*,&,>/<,etc.) - Cluster1: Data conversion & LD/ST - Based on Decoupled Access & Execution (DAE) • Two pipelines: - Each cluster holds its own executive-level pipeline - Share the IF & ID level pipeline • Advantages - Parallelize computation operations with non-computation operations - Perform well on cycle count

Dual clusters and dual coupling pipelines

Explicit Data Organization SIMD ISA • Bottleneck of conventional SIMD ISA - SIMD is inefficient if sub-word data is unaligned each other - SIMD is less flexible than VLIW • Related works - Complex streamed instruction, Delft TU - Stream buffer, Stream processor, Stanford University - Indirect register addressing, Elite project, IBM Cycle percent of conventional SIMD ISA This overhead is reduced by Dual-Cluster How to reduce this overhead?

Explicit Data Organization SIMD ISA • Proposed EDO-SIMD ISA - Explicit data organization information (e.g. 3x8|3:4:7:0:1:2:6:5) Indicate operand relations (align, merge, extract, broadcast, cross) - Append Permutation network onto the RF pipeline of Cluster0 - Add Permutation pipeline in the Cluster1 in parallel with AD0 • Advantages - Merge organization with computation to reduce overhead - As flexible as VLIW - Simplified implementation interpolate DCT Intra predict IIT vOADD vR2<3x8|3:4:7:0:1:2:6:5>, vR1,vR0

2-D stream storage and addressing • Multimedia temporal data behavior - 2-D block by block - Row and column access - Byte alignment - Flexible block jumping • Conventional 1-D addressing impose burdens on Computation Elements for address generation and address alignment tasks • Related works - Linear addressing with circle buffer, Blackfin - Special transpose unit, Trimedia

2-D stream storage and addressing • Proposed storage and addressing mode - 2-D stream storage (base, 2-D stride, 2-D offset) - Row and interleave data arrangement (row access & column access ) - Base update for block jump (UPDATE B0, OX0, OY0, B0) - C-like programming model is friendly to programmer asm:vLDOBR B0, 4, 2, vR0; C:for(i=0; i<8; i++) r [i] = b [2][4+i]; • Advantages - Reduce addressing and aligning overhead (avoid transpose)

Cycle accurate instruction set simulator • Useful for benchmarking and ISA design space exploration during early stage - Input is assemble text program not binary code - Focus on function not micro-architecture • Consist of - Resource modeling - ISA function modeling at each pipeline - Behavior and timing modeling - Debug and profiling support • 3 men for 2 months work, about 60,000 lines C++ code

Cycles for 8x8 IDCT with IEEE compliant precision 600 500 400 300 200 100 0 RISC- MMX TMS320C6x NEC V830 VIRAM Proposed Media[10] Benchmarking and performance • Mapped benchmarks: - Full H.264 baseline decoder kernels like integer transform, intra predict, interpolation and de-blocking. - H.264 fast motion estimation - MPEG2 motion compensation and DCT/IDCT • The cycle accurate and function correct programs help: - Make assembler, simulator more robust - Demonstrate the performance of ISA - Explore and refine ISA (more than 900 instructions are refined to 316 in the end ) • Performance - 4-CIF(704x576) H.264 baseline real-time decoder @ 200MHz - 16 kB code size for H.264 baseline decoder

Conclusions • Integration of a general MCU with heterogeneous ASIPs in a SoC platform is a good choice for media processing in China - a good trade-off between performance and flexibility - overcome our IC design level constraint(200MHz@.18um) • Progress on our Media processor - CK510 and Spock is finished - A dual-core SoC of CK510 and Spock is taped out - Novel features of Schubert are verified and the RTL implement is on-going

Problems Behavior Synthesis tool Application coverage to function coverage SW-HW partition: 10%-90% locality • The Behavior synthesis stage in our ASIP design depends on human experience not tools, which takes too much effort. Assembly instruction set specification Design of Assembler and Simulator Build golden model Benchmark instruction set Behavior function verification • It is very valuable to research and develop CAD tools for design space exploration of ASIP ISA and ASIP SoC communication during the early stage Good performance? Micro-architecture design RTL coding Design for test Backend design RTL code verification Test chip fabrication & test board prototype

Thank you!!!

Progress on media processor design

Progress on media processor design

Presentation Transcript

Processor Design

Asynchronous Processor Design

Processor Design 5Z032

Processor Design

Processor Design

Progress on media processor design

Scalar Processor Design

Pipelined Processor Design

Progress on MeRHIC Design

Stream Architecture: Rethinking Media Processor Design

Processor Design

Processor Design

Processor Design 5Z032

Processor Design

Processor design

Processor Design

Processor Design

Pipelined Processor Design