The Role Of ASIP In Programmable Platforms

The Role Of ASIP In Programmable Platforms

Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIP using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges

A short story of a design paradigm shift

Once upon a time How do I solve the encryption problem?

Data Encryption Standard (DES) Initial step (R, L) = Initial_permutation(Din64) Iterate 16 times Key generation (C, D) = PC1(k) n = rotate_amount (function of iteration count) C = rotate_right(C, n) D = rotate_right (D, n) K = PC2(D, C) Encryption R i+1 = Li Permutation ( S_Box ( K  Expansion ( R ) ) ) L i+1 = Ri Final step Dout64 = Final_permutation(L, R)

The SW engineer very proudly presented static unsigned permute( unsigned char *table, in t n, unsigned hi, unsigned lo) { int ib, ob; unsigned out = 0; for (ob = 0; ob < n; ob++) { ib = table[ob] - 1; if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 << ob; } else { if (lo & (1 << ib)) out |= 1 << ob; } } return out; } This code is fast

The HW engineer laughed Initial step (R, L) = Initial_permutation(Din64) Iterate 16 times Key generation (C, D) = PC1(k) n = rotate_amount (function of iteration count) C = rotate_right(C, n) D = rotate_right (D, n) K = PC2(D, C) Encryption R i+1 = Li Permutation ( S_Box ( K  Expansion ( R ) ) ) L i+1 = Ri Final step Dout64 = Final_permutation(L, R) 200 cycles? I can do it in 1!!! ?

The HW engineer presented Initial Permutation I’ll show you how fast it can be Expansion Permutation Key Generation  S Boxes State Machine P Permutation  Final Permutation

The SW engineer laughed I can change this in 1 minute, can you? Initial Permutation Expansion Permutation Key Generation  ? S Boxes State Machine P Permutation  Final Permutation

Realizing that they each had something the other wanted If only I don’t have to design the controller If only I have just the instruction I need

They decided to work together SETDATA ars, art SETKEY ars, art DES immediate GETDATA ars, hilo Initial Permutation Expansion Permutation Key Generation  S Boxes State Machine P Permutation  Final Permutation

and improved the SW solution by 70x Encryption Decryption SETKEY(K_hi, K_lo); for (;;) { … /* read data */ SETDATA(D_hi, D_lo); DES(ENCRYPT1); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write encrypted data */ } SETKEY(K_hi, K_lo); for (;;) { … /* read encrypted data */ SETDATA(D_hi, D_lo); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write data */ }

When the boss asked how,the SW engineer said:  X SW Solution Registers Memory (Program) Control Datapath  X Correct Efficient SW

and the HW engineer said: Storage FSM  X HW Solution X  Correct Efficient HW

Together, they had the best of both world   ASIP SW Solutions HW Solutions Registers Memory (Program) Control Storage FSM Datapath   Correct Efficient SW HW

The boss was very happy Use Software for Control Optimality/ integration (e.g. mW, $) special hardware ASIP Use Application- specific datapath for computation D ~10x traditional processors + SW Flexibility/modularity (e.g. time-to-market) D ~10x

And they worked together happily ever after

Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIPs using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges

What Is “EEMBC”? • EDN Embedded Microprocessor Benchmark Consortium • Pronounced “Embassy” • Non-profit consortium, funded by over 40 members • Including: ARM, AMD, IBM, Intel, LSI Logic, MIPS, Motorola, National Semi, NEC, TI, Toshiba, Tensilica, and more • Objective: Provide independently certified benchmark scores relevant to deeply embedded processor applications • Independent laboratory recreates and certifies all benchmark results - no tricks

EEMBC Benchmark Suites • Five different benchmark suites • Consumer • Networking • Telecom • Automotive • Office Automation • Each suite comprised of a range (five to sixteen) ofbenchmarks representative of that product category • Example: Consumer • Image compression, image filtering, color conversion

Two Metrics: Out-of-box vs. Optimized • Out-of-Box • Benchmark C code, no manual code optimization,no assembly coding • Optimized, or “Full-Fury” • Conventional Processors • Laboriously hand-tuned assembly code • Rewriting C code to fit the architecture for VLIW or SIMD machines • Changing Code to Fit the Processor • Xtensa • Optimized processor using Xtensa processor generator and TIE Compiler • Changing Processor to Fit the Application!!

Xtensa Optimization Process • Step #1: Configure processor via generator GUI • Compile C-code, evaluate results • Modify configuration as needed • “Out of Box” results measurement taken here • Step #2: Profile Code, Add TIE • Step #3: Optimize Code to Utilize TIE instructions • “Optimized” results measured on final hardware configuration Same Path Used by Tensilica Customers!

Optimized Xtensa Configurations for EEMBC OUT-OF-BOX Configured Xtensa (Using GUI Click box options) Unmodified C-Code OPTIMIZED Configured Xtensa Plus TIE Gates & Instructions C-Code optimizations Consumer Configuration 25000 base gates + 37600 config. gates 200MHz 127K total gates 200MHz 62.6K 64.1K TIE Network Configuration 25000 base gates + 25000 config. gates 200MHz 59K total gates 200MHz 50K 9.2K TIE Telecom Configuration 25000 base gates + 37000 config Gates 200MHz 180K total gates 200MHz VECTRA 18K TIE Illustrations conceptual, see EEBMC report for full details

EEMBC Consumer Benchmark Consumermark Optimized Xtensa Out-of-box Xtensa Processors

EEMBC Consumer Benchmark Consumermark / MHz Optimized Xtensa Out-of-box Xtensa Processors

EEMBC Networking Benchmark Netmark AMD K6 Optimized Xtensa Out-of-box Xtensa Processors

EEMBC Networking Benchmark Netmark / MHz Optimized Xtensa Out-of-box Xtensa AMD K6 Processors

EEMBC Telecom Benchmark BOPS 2x2 Telemark Optimized Xtensa Out-of-box Xtensa Processors

EEMBC Telecom Benchmark BOPS 2x2 Telemark / MHz 1.67 Optimized Xtensa Out-of-box Xtensa Processors

Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIPs using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges

ASIP Generation Flow ALU I/O Timer Pipe Cache Register File MMU Tailored, synthesizable HDL uP core Select processor options ******* **** ******** *** Xtensa Processor Generator • Optimizing C/C++ Compiler • Cycle-accurate Simulator • Assembler • Linker • C/C++/asm/inst Debugger • RTOS Describe new instructions In Minutes!

Tensilica Instruction Extension (TIE) Lang. opcode PMAC op2=0 CUST0 state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} { assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16]; } schedule pmac_schd {PMAC} { use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2; }

Sample platforms Vitesse PRISM IQ2000 Intel IXP1200 Motorola C-Port CDP C-5 PMC-Sierra VoIP Gateway

Observations • Heterogeneous processing elements • General purpose processors • Micro-controllers • Dedicated blocks • Heterogeneous communication links • Bandwidth • Latency • Hardware overhead • Communication overhead

Two Legs Of Platform Design Platform Designer Processing Element Design Communication Design

ASIP requirements • Match the performance of hard-wired logic • Offer variety of performance/cost points • Easy to design • Easy to use

Fixed Processors Cannot Replace ASIC • Spatial bottleneck: • not enough bandwidth • Temporal bottleneck: • Limited functionality Decoder RF0 Source FU0 Control Result

Adding Customized Function Units to Break Temporal Bottleneck Decoder RF0 Source routing FU0 FU1 FU2 FU3 Control Result routing

Example of Customized Functional Unit opcode PMAC op2=0 CUST0 state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} { assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16]; } schedule pmac_schd {PMAC} { use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2; }

Effectiveness of Customized Functional Unit Requirements: • Performance - similar • Cost - similar • Ease of design – similar TIE: assign ACC1 = ACC1 + ars[15:0] * art[15:0]; • Ease of use – much easier C: PMAC(x, y);

Adding Processor States to Break Spatial Bottleneck RF0 FU0 FU1 FU2 FU3 Decoder S0 S1 Source routing Control Result routing

Example of Processor States opcode PMAC op2=0 CUST0 state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} { assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16]; } schedule pmac_schd {PMAC} { use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2; }

Effectiveness of Processor States Requirements: • Performance – better Especially when used with pipelined functional units • Cost – higher due to pipelined implementation • Ease of design – very simple state ACC1 40 • Ease of use – very easy PMAC(x, y); /* implicitly using the states */ x = R_ACC1_Lo(); W_ACC1_Hi(y);

Sharing States Using Register Files RF0 RF1 RF2 FU0 FU1 FU2 FU3 Decoder S0 S1 Source routing Control Result routing

Example of a Register File regfile RF24 24 16 r operand vs s {RF24[s]} operand vt t {RF24[t]} operand vr r {RF24[r]} iclass rrr {average} {out vr, in vs, in vt} reference average { wire [8:0] t2 = vs[23:16] + vt[23:16]; wire [8:0] t1 = vs[15:8] + vt[15:8]; wire [8:0] t0 = vs[7:0] + vt[7:0]; assign vr = {t2[8:1], t1[8:1], t0[8:1]}; } ctype rgb 24 32 RF24 Control

Crossing the HW/SW Boundary • Working with typed data: rgb x, y, z; /* C code */ • Letting C-Compiler allocate the registers z = average(x, y); /* assembly: average v1, v4, v6 */ • Letting C-Compiler spill the registers • Letting C-Compiler convert to/from other types yuv a, b; b = average (a, y); • Auto saved/restored on context switching

Effectiveness of Register File Requirements: • Performance – better Especially when used with pipelined functional units • Cost – higher due to pipelined implementation • Ease of design – very simple regfile RF24 24 16 r • Ease of use – very easy rgb x, y, z; z = average(x, y);

Multi-cycle Instructions S0 S1 RF0 RF1 RF2 FU0 FU1 FU2 FU3 Decoder Source routing Control Result routing

The Role Of ASIP In Programmable Platforms