170 likes | 327 Views
Multi-core SOC for Future Media Processing. Qin Xing, Yan Xiaolang The Institute of VLSI Design, Zhejiang University. Outline. Opportunities & challenges from media processing Multimedia algorithm characteristics & mapping Multi-core SOC architecture & technology
E N D
Multi-core SOC for Future Media Processing Qin Xing, Yan Xiaolang The Institute of VLSI Design, Zhejiang University
Outline • Opportunities & challenges from media processing • Multimedia algorithm characteristics & mapping • Multi-core SOC architecture & technology • Benchmarking results • Project status • Future work The Institute of VLSI Design, Zhejiang Univ.
Opportunities • Video conference • IP-phone • Smart terminal • PDA • Video camera • HDTV • Set-top box • … The Institute of VLSI Design, Zhejiang Univ.
Challenges—multiple standards 1st MPEG-2 Encoder 6 MPEG-2 MPEG-4 2nd Generation Encoder 5 H.26L H.263 H.264 3rd Generation Encoder WMV 4 VP3 AVS 4th Generation Encoder 3 Mbit/s 5th Generation Encoder WMV 2 VP3 AVS 1 H.264 / MPEG-4 part 10 0 The Institute of VLSI Design, Zhejiang Univ. 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Challenges — excellent hardware • Very high computation complexity • H.264 encoding of 720 x 576 pixels @ 30 frames/s needs up to 30 GOPS • Multiple standards co-exist • Demands of flexibility & programmability • Low power • Low cost Best choice : Application Specific Instruction Processor The Institute of VLSI Design, Zhejiang Univ.
Multimedia algorithm characteristics • Outer-loop and inner loop • Outer loop: Interface (GUI) Os (Linux) Bit-stream parsing (park/unpack, VLC, CABAC) Data transferring • Inner loop: Regular algorithms (Prediction, FIR, DCT, motion estimation) The Institute of VLSI Design, Zhejiang Univ.
Multimedia algorithm mapping • Programmable and heterogeneous processors are the preferred choice for the implementation • General MCU (RISC core) — outer loop • Enhanced DSP(EDSP, +bit wise operation) —outer loop • Vector processor(VP, VLIW+SIMD) — inner loop The Institute of VLSI Design, Zhejiang Univ.
Multi-core SOC architecture • Top level Media processing kernel The Institute of VLSI Design, Zhejiang Univ.
Inside the media processing kernel GAG2 GAG1 GAG4 GAG3 GDM V-DM1 V-DM2 V-DM3 GTM V-DM4 EDSP-control path Vector control path DMA and off chip memories 2D crossbar connection network E-DP V-DP1 V-DP2 V-DP3 V-DP4 The Institute of VLSI Design, Zhejiang Univ.
Technologies— specified instruction set __asm{ mov edx, mptr movdqu xmm1, [edx] packssdw xmm1,xmm1// read m50] from memory to xmm1} __asm{ movdqu xmm4, [edx +48] packssdw xmm4,xmm4// read m5[3] from memory} __asm{ movq xmm5,xmm1 psubw xmm1,xmm3 //m6[1]=(m5[0]-m5[2]); paddw xmm3,xmm5 //m6[0]=(m5[0]+m5[2]); movq xmm5, xmm2 psraw xmm2,1 psubw xmm2,xmm4 //m6[2]=(m5[1]>>1)-m5[3] psraw xmm4,1 paddw xmm4,xmm5 //m6[3]=m5[1]+(m5[3]>>1)} for (j=0;j<BLOCK_SIZE;j++){ for (i=0;i<BLOCK_SIZE;i++){ m5[i]=img->cof[i0][j0][i][j]; } m6[0]=(m5[0]+m5[2]); m6[1]=(m5[0]-m5[2]); m6[2]=(m5[1]>>1)-m5[3]; m6[3]=m5[1]+(m5[3]>>1); } Our IS 6 cycles Integer IDCT in H.264 Intel MMX:13 cycles The Institute of VLSI Design, Zhejiang Univ.
Technologies—instruction mergence Load/Store 30% result = 0; pres_y = dy == 1 ? y_pos : y_pos+1; pres_y = max(0,min(maxold_y,pres_y));//load for(x=-2;x<4;x++) //control { pres_x = max(0,min(maxold_x,x_pos+x));//load result += imY[pres_y][pres_x]*COEF[x+2]; // computation, permutation and load } result1 = max(0, min(255, (result+16)/32));//computation Permutation 25% Computation 35% Control 10% Ld/St and Perm. Merged Computation 6 – tap sub- pixels interpolation Control The Institute of VLSI Design, Zhejiang Univ. Reduce a half of time
Benchmarking results for CPU core • CK520 The Institute of VLSI Design, Zhejiang Univ.
Simulation results for DSP performance • Enhanced DSP • CAVLC(context adaptive variable length coding) • OGG(new audio standard) The Institute of VLSI Design, Zhejiang Univ.
Simulation results for DSP performance • Vector processor • H.264 baseline decoder The Institute of VLSI Design, Zhejiang Univ.
Project status • Finished 2 versions of CPU Core • Released DSP instruction set • Writing and verifying RTL of the enhanced DSP • Benchmarking vector processor • Developing software tools The Institute of VLSI Design, Zhejiang Univ.
Future work • Scheduling for task level parallelism(TLP) between heterogeneous processors • Simulation/debugging tools for heterogeneous processors • Methodologies for design space exploration The Institute of VLSI Design, Zhejiang Univ.
Thank you! The Institute of VLSI Design, Zhejiang Univ.