230 likes | 350 Views
Introducing the ConnX D2 DSP Engine. Introduced: August 24, 2009. Fastest Growing Processor / DSP IP Company. Customizable Dataplane Processor/DSP IP Licensing Leading provider of customizable Dataplane Processor Units (DPUs)
E N D
Introducing theConnX D2 DSP Engine Introduced: August 24, 2009
Fastest Growing Processor / DSP IP Company • Customizable Dataplane Processor/DSP IP Licensing • Leading provider of customizable Dataplane Processor Units (DPUs) • Unique combination of processor & DSP IP cores + software design tools • Customization enables improved power, cost, performance • Standard DPU solutions for audio, video/imaging & baseband comms • Dominant patent portfolio for configurable processor technology • Broad-Based Success • 150+ Licensees, including 5 of the top 10 semiconductor companies • Shipping in high volume today (>200M/yr rate) • Fastest growing Semiconductor Processor IP company (per Gartner, Jan-09) • 21% revenue growth in 2007, 25% in 2008 2
Focus: Dataplane Processing Units (DPUs) DPUs: Customizable CPU+DSP delivering 10 to 100x higher performance than CPU or DSP and providing better flexibility & verification than RTL Embedded Controller For Dataplane Processing Main Applications CPU Tensilica focus: Dataplane Processors 3
Communications DSP Trends / Challenges • Code Size Increases • Communications standards growing innumber & complexity • DSP algorithm code heavily integrated with more (and more complex) control code Maintenance and flexibility pushes DSP algorithms towards C-code • Development Teams Shrink • SOC development schedules tightening • Tightening resource constraints (do more with less) • Markets Changing Faster • Market requirements in flux as economy wobbles • Emerging standards evolve faster in the Internet age 4
Trends Within Licensable DSP Architectures • 1st Generation Licensable DSP Cores • Modest/Medium performance (single/dual MAC) • Simple architecture (single issue, compound Instructions) • Limited or no compiler support (mostly hand coded) • 2nd Generation Licensable DSP Cores • Added RISC like architecture features (register arrays) • Improved compiler targets, but still assembly • Some offer wide VLIW for performance • Large area; code bloat • Some offer wide SIMD for performance • Good area/performance tradeoff • No performance when vectorization fails 5
Vectorization Benefits (SIMD) After Vectorization Before Vectorization Data7 Data6 Data7 Data6 Data4 Data5 Data5 Data2 Data3 Data4 Data0 Data1 Data3 Data2 2-way SIMD Execution Data1 Data0 Single Execution • Loop counts can be reduced • Data computation can be done in parallel • Cheapest (hardware cost) method to get higher performance Example: 2-way SIMD performance benefit 6
VLIW Technology Instruction #4 Instruction #3 Instruction #3 Instruction #4 Instruction #2 Instruction #1 Instruction #2 Instruction #1 VLIW Execution ALU2 VLIW Execution ALU1 Execution ALU Parallel execution of Instructions Effective use of multiple ALUs/MACs Compiler allocates instructions to VLIW slots Orthogonal allocation yields more flexibility
Ideal 3rd Generation Licensable DSP • Ideal Characteristics • VLIW capability for good performance on general code • Parallelization of independent operations • SIMD capability for good performance on loop code • Data parallel execution • Good C compiler target • Reduce or eliminate need to assembly program • Productivity benefit • Small, compact size • Keep costs down in brutally competitive markets 8
Tensilica - the Stealth DSP Company Comms Audio Video Xtensa: Other Markets Custom DSPs DSP Building Blocks ConnX BBE 16 MAC 8 MAC and more Xtensa TIE 388VDO ConnX 545CK DSP 8 MAC Double Precision Acceleration Floating Point HW ConnXVectra LX Single Precision Floating Point Unit Quad MAC DIV32 ConnX D2 HiFi 2 Dual MAC MUL32 MAC16 Single MAC 9
ConnX D2 DSP Engine - Overview • Dual 16b MAC Architecture with Hybrid SIMD / VLIW • Optimum performance on a wide range of algorithms • SIMD offers high data computation rate for DSP algorithms • 2-way VLIW allows parallel instruction execution on SIMD and scalar code • “Out of the Box” industry standard software compatibility • TI C6x fixed-point C intrinsics supported • Fully bit for bit equivalent with TI C6x • ITU reference code fixed point C intrinsics directly supported • Goals: Ease of Use, Low Area/Cost • Click and go “Out of the Box” performance from standard C code • Standard C and fixed point data types - 16-bit, 32-bit and 40-bit • Advanced optimizing, vectorizing compiler • Less than 70K gates (under 0.2mm2 in 65nm) 10
Target Applications: ConnX D2 General purpose 16-bit DSP for a wide range of applications • Embedded control • VoIP gateways, voice-over-networks (including VoIP codecs) • Femto-cell and pico-cell base stations • Next generation disk drives, data storage • Mobile terminals and handsets • Home entertainment devices • Computer peripherals, printers 11
ConnX D2 DSP:An ingredient of an Xtensa DPU • Hardware Use Model • Click-button configuration option within Xtensa LX core • Part of the Tensilica configurable core deliverable package • Two reference configurations • Typical DSP solution for high performance • Small size for cost and power sensitive applications • Full tool support from Tensilica • High level simulators (SystemC), ISS and RTL • Debugger and Trace • Compiler, IDE and Operating Systems 12
ConnX D2 Engine Architecture AR Register Bank (32 bits) Local Memory and/or Cache Load Store Unit 32b 32-bits 32-bits 32b 32b XDU Alignment Registers (4 x 32 bits) XDD Register File (8 x 40-bits) 40-bit, 32-bit & 16-bit integer Overflow State 40-bit, 32-bit & 16-bit fixed Carry State 8-bit 8-bit 8-bit 8-bit 16-bit vector 16-bit vector 16-bit imaginary 16-bit real Hi / Lo 16-bit select 16-bits 16-bits 16-bits • Addressing Modes • Immediate • Immediate updating • Indexed • Indexed updating • Aligning updating • Circular (instruction) • Bit-reversed (instruction) • DSP specific instructions • Add-Bit-Reverse-Base and Add-Subtract : Useful for FFT implementation • Add-Compare-Exchange : Useful for Viterbi implementation • Add-Modulo : Circular buffer implementation. Useful for FIR implementation 16-bits 14
ConnX D2 : Instruction Allocation Options 16-bit Instructions Base ISA 24-bit Instructions Base ISA or ConnX D2 VLIW Instructions (64-bits) Slot 0 ConnX D2 or Base ISA Slot 1 ConnX D2 or Base ISA (register moves & C ops on register data) • Flexible allocation of instructions available to compiler • Optimum use of VLIW slots (ConnX D2 or base ISA instructions) • Improved performance and no code bloat (reduced NOPs) • Reduce code size when algorithm is less performance intensive • Modeless switching between instruction formats 15
ConnX D2 : SIMD with VLIW – Extra Performance Combining SIMD and VLIW can give 6 times performance Example : Energy Calculation SIMD Computation 127 A = ∑Xn* Xn n=0 128 iteration C algorithm Instruction Execution (Control) loopgtz a3,.LBB52_energy # [3] l16si a3,a2,2 # [0*II+0] id:16 a+0x0 l16si a5,a2,4 # [0*II+1] id:16 a+0x0 l16si a6,a2,6 # [0*II+2] id:16 a+0x0 l16si a7,a2,8 # [0*II+3] id:16 a+0x0 mul16s a3,a3,a3 # [0*II+4] mul16s a5,a5,a5 # [0*II+5] mul16s a6,a6,a6 # [0*II+6] mul16s a7,a7,a7 # [0*II+7] addi.n a2,a2,8 # [0*II+8] add.n a3,a4,a3 # [0*II+9] add.n a3,a3,a5 # [0*II+10] add.n a3,a3,a6 # [0*II+11] add.n a4,a3,a7 # [0*II+12] Slot1 Slot0 • Vectorization and SIMD gives double data computation performance • VLIW gives 2 pipeline executions (one is SIMD) with auto-increment loads • ConnX D2 architecture gives this combination and performance 416 cycles Base Xtensa Configuration ConnX D2: 64 cycles loop { # format XD2_FLIX_FORMAT xd2_la.d16x2s.iu xdd0,xdu0,a4,4; xd2_mulaa40.d16s.ll.hh xdd1,xdd0,xdd0 } One instruction (64-bit VLIW instruction) 16
When Vectorization is Not PossiblePerformance for scalar code bases int energy(short *a, intcol, int cols, int rows) { inti; int sum=0; for (i=0; i<rows; i++) { sum += a[cols*i+col] * a[cols*i+col]; } return sum; } • Energy computation of column ‘col’ in 2-D array • Above code loop cannot be vectorized • Non–contiguous memory accesses thwarts vectorizers • Regular compilers can not map this code into traditional SIMD DSPs 17
When Vectorization is Not PossiblePerformance for scalar code bases int energy(short *a, intcol, int cols, int rows) { inti; int sum=0; for (i=0; i<rows; i++) { sum += a[cols*i+col] * a[cols*i+col]; } return sum; } • Confirmed that ConnX D2 and TI C6x compilers can not vectorize this code • ConnX D2 compiler can however use VLIW to increase performance C-Code entry a1,32 blti a5,1,.Lt_0_2306 addx2 a2,a3,a2 slli a3,a4,1 addi.n a4,a5,-1 sub a2,a2,a3 { # format XD2_FLIX_FORMAT xd2_l.d16s.xu xdd0,a2,a3 xd2_movi.d40 xdd1,0 } loopgtz a4,.LBB43_energy { # format XD2_FLIX_FORMAT xd2_l.d16s.xu xdd0,a2,a3 ; xd2_mula32.d16s.ll_s1 xdd1,xdd0,xdd0 } ………… Generated Assembly Code ConnX D2 : One cycle within loop Load scalar 16-bits xdd0 is loaded with memory contents defined in a2 register. a2 register value is updated by value in a3 MAC operation on lower 16-bits. Multiplies xdd0 with xdd0. Accumulated result is stored in xdd1 18
Optimization with ITU / TI IntrinsicsPerformance for generic code bases Energy calculation loop 1000 looping, using L_mac ITU intrinsic #define ASIZE 1000 extern int a[ASIZE]; extern int red; void energy() { inti; int red_0 = red; for (i = 0; i < ASIZE; i++) { red_0 = L_mac(red_0, a[i], a[i]); } red = red_0; } L_mac maps to one ConnX D2 instruction Compiler further optimizes by using SIMD to accelerate loop VLIW allows further accelerates with parallel loads 1000 loop C algorithm optimized to 500 cycles loop Sustained 3 operations / cycle entry a1,32 l32r a2,.LC1_40_18 l32r a5,.LC0_40_17 xd2_l.d16x2s.iu xdd0,a2,4 test_arr_1+0x0 l32i.n a3,a5,0 test_global_red_0+0x0 { # format XD2_ARUSEDEF_FORMAT xd2_mov.d32.a32s xdd1,a3 movi a3,499 } loopgtz a3, { # format XD2_FLIX_FORMAT xd2_l.d16x2s.iu xdd0,a2,4; xd2_mulaa.fs32.d16s.ll.hh xdd1,xdd0,xdd0 } .......... Generated Assembly Code 19
“Out of the Box” Performance - Results • Comparison to TI C55x • (TI C55x is an industry benchmark Dual-MAC, 2-way VLIW) • 20% more performance (256 point complex FFT) • Comparison to other DSP IP vendors • Almost twice the performance Why better? • FFT specific instructions • Dual write to Register Files • Advanced Complier • SIMD and VLIW performance Why better? • 1 to 1 mapping of ITU intrinsics • SIMD and VLIW performance • Flexibility in VLIW allocation • VLIW Performance for scalar code * - 2008, From CEVA published Whitepaper # - Dec 2008, www.ti.com 20
Small, Low Power, & High Performance • Optimized for low area / low cost applications • Less than 70,000 gates • 0.18mm2 in 65nm GP * Low power • 52uW/MHz power consumption • 65nm GP, measured running AMR-NB algorithm Very high performance • 600MHz in 65nm GP ** * - After full Place and Route, when optimized for area/power. Size is for the full Xtensa core including the D2 DSP option ** - After full Place and Route, when optimized for speed 21
Flexible and Customizable • Configure memory subsystems to exact requirements • Up to 4 local memories • Instruction memory, data memory • RAM and ROM options • DMA path into these memories • Instruction and data cache configurations • MMU and memory region protection • Memory port interface • Option of dual load/store architecture Full customization • Instruction set extensions • Custom I/O Interfaces • TIE Ports, Queues and Lookup Memory interfaces 22
ConnX D2 DSP Engine: Summary • Small size • Low power • Excellent performance on wide range of code • Easy to use – C programming centric • “Out of the Box” performance • Reduce development time – reduced cost • ITU and T.I. C intrinsic support – large existing code base • Bit equivalent to TI C6x • Take current TI code, port and get same functionality on ConnX D2 • Flexible & customizable 23