1 / 22

Progress on media processor design

Progress on media processor design. Presented by Chunyue Liu ( liucy@vlsi.zju.edu.cn ). Xiaolang Yan ( yan@vlsi.zju.edu.cn ) Xing Qin ( qinx@vlsi.zju.edu.cn )

Download Presentation

Progress on media processor design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Progress on media processor design Presented by Chunyue Liu (liucy@vlsi.zju.edu.cn) Xiaolang Yan (yan@vlsi.zju.edu.cn) Xing Qin (qinx@vlsi.zju.edu.cn) Jian Yang (yangj@vlsi.zju.edu.cn) Xiaohua Luo (luoxh@vlsi.zju.edu.cn) Peiyong Zhang (zhangpy@vlsi.zju.edu.cn) Dake Liu (dake@isy.liu.se) Embedded DSP Research & Develop Group

  2. Outline • Overview of media processor • Progress on Spock • Progress on Schubert - Overview - Key features - Performance • Conclusions & Problems

  3. Background and Challenges • Media applications have very high computation complexity - H.264 encoding of 720 x 576 pixels @ 30 frames /s up to 30 GOPS • Media processor is on the demand - Some state of art Media Processors (e.g. Nomatic, da Vinci) • Multiple standards coexist -Flexible & programmable • Our current IC design level constraint (200MHz@.18um) • ASIP is the best choice • Our proposal on IC-DFN’05

  4. Overview of media processor • Programmable and heterogeneous processors on a SoC platform - General MCU(CK510, a 32-bit RISC core) Interface (GUI), Os (Linux) - Enhanced DSP (Spock) Audio processing, Bitstream parsing, Data transferring - Vector processor (Schubert) Video processing

  5. Outline • Overview of media processor • Progress on Spock • Progress on Schubert - Overview - Key features - Performance • Conclusions & Problems

  6. Progress on Spock • Developed tools chain - Assembler, Simulator and Debugger • FPGA prototype: real time decoding -128kb/s OGG @ 40MHz • To test Spock , Dual-core SoC platform is developed - Integrated with CK510 - Inter-processor communication uses mailbox and shared memory -.18um, less than 500mw ,166MHz - CK510 core area: 2 x 2 mm2 - Spock core area: 1.5 x 1.5 mm2

  7. Overview of Spock • Optimization for Control - Branch optimization: conditional execution 2-level hardware loop, repeat • Optimization for Signal Processing - Multiple addressing mode: Post address ++/-- Reverse/module addressing -MAC with parallel load -VLX instruction set extension: putbits, showbits, getbits, etc.

  8. Outline • Overview of media processor • Progress on Spock • Progress on Schubert - Overview - Key features - Performance • Conclusions & Problems

  9. Progress on Schubert Application coverage to function coverage • Design Methodology • Released 316 novel instructions - SIMD and RISC • Developed tools chain - Assembler - Cycle-accurate Simulator • Mapped kernels H.264/AVC - IT/IIT, Intra/inter-prediction - de-blocking, Motion estimation MPEG2 - DCT, Motion compensation • Micro-Architecture is designed estimated area: 3.5 x 3.5 mm2@.18um with a 70KB SRAM SW-HW partition: 10%-90% locality Assembly instruction set specification Design of Assembler and Simulator Build golden model Benchmark instruction set Behavior function verification Good performance? Micro-architecture design RTL coding Design for test Backend design RTL code verification Test chip fabrication & test board prototype

  10. Key features of Schubert • Dual clusters and dual coupling pipelines - SIMD combined with VLIW architecture • Explicit Data Organization SIMD (EDO-SIMD) • 2-Dimensional and byte-align addressing storage • Cycle accurate instruction set simulator

  11. Dual clusters and dual coupling pipelines • Two clusters: - Cluster0: Computation (+/-,*,&,>/<,etc.) - Cluster1: Data conversion & LD/ST - Based on Decoupled Access & Execution (DAE) • Two pipelines: - Each cluster holds its own executive-level pipeline - Share the IF & ID level pipeline • Advantages - Parallelize computation operations with non-computation operations - Perform well on cycle count

  12. Dual clusters and dual coupling pipelines

  13. Explicit Data Organization SIMD ISA • Bottleneck of conventional SIMD ISA - SIMD is inefficient if sub-word data is unaligned each other - SIMD is less flexible than VLIW • Related works - Complex streamed instruction, Delft TU - Stream buffer, Stream processor, Stanford University - Indirect register addressing, Elite project, IBM Cycle percent of conventional SIMD ISA This overhead is reduced by Dual-Cluster How to reduce this overhead?

  14. Explicit Data Organization SIMD ISA • Proposed EDO-SIMD ISA - Explicit data organization information (e.g. 3x8|3:4:7:0:1:2:6:5) Indicate operand relations (align, merge, extract, broadcast, cross) - Append Permutation network onto the RF pipeline of Cluster0 - Add Permutation pipeline in the Cluster1 in parallel with AD0 • Advantages - Merge organization with computation to reduce overhead - As flexible as VLIW - Simplified implementation interpolate DCT Intra predict IIT vOADD vR2<3x8|3:4:7:0:1:2:6:5>, vR1,vR0

  15. 2-D stream storage and addressing • Multimedia temporal data behavior - 2-D block by block - Row and column access - Byte alignment - Flexible block jumping • Conventional 1-D addressing impose burdens on Computation Elements for address generation and address alignment tasks • Related works - Linear addressing with circle buffer, Blackfin - Special transpose unit, Trimedia

  16. 2-D stream storage and addressing • Proposed storage and addressing mode - 2-D stream storage (base, 2-D stride, 2-D offset) - Row and interleave data arrangement (row access & column access ) - Base update for block jump (UPDATE B0, OX0, OY0, B0) - C-like programming model is friendly to programmer asm:vLDOBR B0, 4, 2, vR0; C:for(i=0; i<8; i++) r [i] = b [2][4+i]; • Advantages - Reduce addressing and aligning overhead (avoid transpose)

  17. Cycle accurate instruction set simulator • Useful for benchmarking and ISA design space exploration during early stage - Input is assemble text program not binary code - Focus on function not micro-architecture • Consist of - Resource modeling - ISA function modeling at each pipeline - Behavior and timing modeling - Debug and profiling support • 3 men for 2 months work, about 60,000 lines C++ code

  18. Cycles for 8x8 IDCT with IEEE compliant precision 600 500 400 300 200 100 0 RISC- MMX TMS320C6x NEC V830 VIRAM Proposed Media[10] Benchmarking and performance • Mapped benchmarks: - Full H.264 baseline decoder kernels like integer transform, intra predict, interpolation and de-blocking. - H.264 fast motion estimation - MPEG2 motion compensation and DCT/IDCT • The cycle accurate and function correct programs help: - Make assembler, simulator more robust - Demonstrate the performance of ISA - Explore and refine ISA (more than 900 instructions are refined to 316 in the end ) • Performance - 4-CIF(704x576) H.264 baseline real-time decoder @ 200MHz - 16 kB code size for H.264 baseline decoder

  19. Outline • Overview of media processor • Progress on Spock • Progress on Schubert - Overview - Key features - Performance • Conclusions & Problems

  20. Conclusions • Integration of a general MCU with heterogeneous ASIPs in a SoC platform is a good choice for media processing in China - a good trade-off between performance and flexibility - overcome our IC design level constraint(200MHz@.18um) • Progress on our Media processor - CK510 and Spock is finished - A dual-core SoC of CK510 and Spock is taped out - Novel features of Schubert are verified and the RTL implement is on-going

  21. Problems Behavior Synthesis tool Application coverage to function coverage SW-HW partition: 10%-90% locality • The Behavior synthesis stage in our ASIP design depends on human experience not tools, which takes too much effort. Assembly instruction set specification Design of Assembler and Simulator Build golden model Benchmark instruction set Behavior function verification • It is very valuable to research and develop CAD tools for design space exploration of ASIP ISA and ASIP SoC communication during the early stage Good performance? Micro-architecture design RTL coding Design for test Backend design RTL code verification Test chip fabrication & test board prototype

  22. Thank you!!!

More Related