1 / 17

Multi-core SOC for Future Media Processing

Multi-core SOC for Future Media Processing. Qin Xing, Yan Xiaolang The Institute of VLSI Design, Zhejiang University. Outline. Opportunities & challenges from media processing Multimedia algorithm characteristics & mapping Multi-core SOC architecture & technology

Download Presentation

Multi-core SOC for Future Media Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-core SOC for Future Media Processing Qin Xing, Yan Xiaolang The Institute of VLSI Design, Zhejiang University

  2. Outline • Opportunities & challenges from media processing • Multimedia algorithm characteristics & mapping • Multi-core SOC architecture & technology • Benchmarking results • Project status • Future work The Institute of VLSI Design, Zhejiang Univ.

  3. Opportunities • Video conference • IP-phone • Smart terminal • PDA • Video camera • HDTV • Set-top box • … The Institute of VLSI Design, Zhejiang Univ.

  4. Challenges—multiple standards 1st MPEG-2 Encoder 6 MPEG-2 MPEG-4 2nd Generation Encoder 5 H.26L H.263 H.264 3rd Generation Encoder WMV 4 VP3 AVS 4th Generation Encoder 3 Mbit/s 5th Generation Encoder WMV 2 VP3 AVS 1 H.264 / MPEG-4 part 10 0 The Institute of VLSI Design, Zhejiang Univ. 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

  5. Challenges — excellent hardware • Very high computation complexity • H.264 encoding of 720 x 576 pixels @ 30 frames/s needs up to 30 GOPS • Multiple standards co-exist • Demands of flexibility & programmability • Low power • Low cost Best choice : Application Specific Instruction Processor The Institute of VLSI Design, Zhejiang Univ.

  6. Multimedia algorithm characteristics • Outer-loop and inner loop • Outer loop: Interface (GUI) Os (Linux) Bit-stream parsing (park/unpack, VLC, CABAC) Data transferring • Inner loop: Regular algorithms (Prediction, FIR, DCT, motion estimation) The Institute of VLSI Design, Zhejiang Univ.

  7. Multimedia algorithm mapping • Programmable and heterogeneous processors are the preferred choice for the implementation • General MCU (RISC core) — outer loop • Enhanced DSP(EDSP, +bit wise operation) —outer loop • Vector processor(VP, VLIW+SIMD) — inner loop The Institute of VLSI Design, Zhejiang Univ.

  8. Multi-core SOC architecture • Top level Media processing kernel The Institute of VLSI Design, Zhejiang Univ.

  9. Inside the media processing kernel GAG2 GAG1 GAG4 GAG3 GDM V-DM1 V-DM2 V-DM3 GTM V-DM4 EDSP-control path Vector control path DMA and off chip memories 2D crossbar connection network E-DP V-DP1 V-DP2 V-DP3 V-DP4 The Institute of VLSI Design, Zhejiang Univ.

  10. Technologies— specified instruction set __asm{ mov edx, mptr movdqu xmm1, [edx] packssdw xmm1,xmm1// read m50] from memory to xmm1} __asm{ movdqu xmm4, [edx +48] packssdw xmm4,xmm4// read m5[3] from memory} __asm{ movq xmm5,xmm1 psubw xmm1,xmm3 //m6[1]=(m5[0]-m5[2]); paddw xmm3,xmm5 //m6[0]=(m5[0]+m5[2]); movq xmm5, xmm2 psraw xmm2,1 psubw xmm2,xmm4 //m6[2]=(m5[1]>>1)-m5[3] psraw xmm4,1 paddw xmm4,xmm5 //m6[3]=m5[1]+(m5[3]>>1)} for (j=0;j<BLOCK_SIZE;j++){ for (i=0;i<BLOCK_SIZE;i++){ m5[i]=img->cof[i0][j0][i][j]; } m6[0]=(m5[0]+m5[2]); m6[1]=(m5[0]-m5[2]); m6[2]=(m5[1]>>1)-m5[3]; m6[3]=m5[1]+(m5[3]>>1); } Our IS 6 cycles Integer IDCT in H.264 Intel MMX:13 cycles The Institute of VLSI Design, Zhejiang Univ.

  11. Technologies—instruction mergence Load/Store 30% result = 0; pres_y = dy == 1 ? y_pos : y_pos+1; pres_y = max(0,min(maxold_y,pres_y));//load for(x=-2;x<4;x++) //control { pres_x = max(0,min(maxold_x,x_pos+x));//load result += imY[pres_y][pres_x]*COEF[x+2]; // computation, permutation and load } result1 = max(0, min(255, (result+16)/32));//computation Permutation 25% Computation 35% Control 10% Ld/St and Perm. Merged Computation 6 – tap sub- pixels interpolation Control The Institute of VLSI Design, Zhejiang Univ. Reduce a half of time

  12. Benchmarking results for CPU core • CK520 The Institute of VLSI Design, Zhejiang Univ.

  13. Simulation results for DSP performance • Enhanced DSP • CAVLC(context adaptive variable length coding) • OGG(new audio standard) The Institute of VLSI Design, Zhejiang Univ.

  14. Simulation results for DSP performance • Vector processor • H.264 baseline decoder The Institute of VLSI Design, Zhejiang Univ.

  15. Project status • Finished 2 versions of CPU Core • Released DSP instruction set • Writing and verifying RTL of the enhanced DSP • Benchmarking vector processor • Developing software tools The Institute of VLSI Design, Zhejiang Univ.

  16. Future work • Scheduling for task level parallelism(TLP) between heterogeneous processors • Simulation/debugging tools for heterogeneous processors • Methodologies for design space exploration The Institute of VLSI Design, Zhejiang Univ.

  17. Thank you! The Institute of VLSI Design, Zhejiang Univ.

More Related