650 likes | 669 Views
The Role Of ASIP In Programmable Platforms. Outline. Using ASIP – a new design paradigm EEMBC – a case study Designing ASIP using Xtensa and TIE Addressing the needs of platforms ASIP computing capabilities ASIP communication capabilities Challenges. A short story of
E N D
The Role Of ASIP In Programmable Platforms
Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIP using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges
A short story of a design paradigm shift
Once upon a time How do I solve the encryption problem?
Data Encryption Standard (DES) Initial step (R, L) = Initial_permutation(Din64) Iterate 16 times Key generation (C, D) = PC1(k) n = rotate_amount (function of iteration count) C = rotate_right(C, n) D = rotate_right (D, n) K = PC2(D, C) Encryption R i+1 = Li Permutation ( S_Box ( K Expansion ( R ) ) ) L i+1 = Ri Final step Dout64 = Final_permutation(L, R)
The SW engineer very proudly presented static unsigned permute( unsigned char *table, in t n, unsigned hi, unsigned lo) { int ib, ob; unsigned out = 0; for (ob = 0; ob < n; ob++) { ib = table[ob] - 1; if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 << ob; } else { if (lo & (1 << ib)) out |= 1 << ob; } } return out; } This code is fast
The HW engineer laughed Initial step (R, L) = Initial_permutation(Din64) Iterate 16 times Key generation (C, D) = PC1(k) n = rotate_amount (function of iteration count) C = rotate_right(C, n) D = rotate_right (D, n) K = PC2(D, C) Encryption R i+1 = Li Permutation ( S_Box ( K Expansion ( R ) ) ) L i+1 = Ri Final step Dout64 = Final_permutation(L, R) 200 cycles? I can do it in 1!!! ?
The HW engineer presented Initial Permutation I’ll show you how fast it can be Expansion Permutation Key Generation S Boxes State Machine P Permutation Final Permutation
The SW engineer laughed I can change this in 1 minute, can you? Initial Permutation Expansion Permutation Key Generation ? S Boxes State Machine P Permutation Final Permutation
Realizing that they each had something the other wanted If only I don’t have to design the controller If only I have just the instruction I need
They decided to work together SETDATA ars, art SETKEY ars, art DES immediate GETDATA ars, hilo Initial Permutation Expansion Permutation Key Generation S Boxes State Machine P Permutation Final Permutation
and improved the SW solution by 70x Encryption Decryption SETKEY(K_hi, K_lo); for (;;) { … /* read data */ SETDATA(D_hi, D_lo); DES(ENCRYPT1); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write encrypted data */ } SETKEY(K_hi, K_lo); for (;;) { … /* read encrypted data */ SETDATA(D_hi, D_lo); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write data */ }
When the boss asked how,the SW engineer said: X SW Solution Registers Memory (Program) Control Datapath X Correct Efficient SW
and the HW engineer said: Storage FSM X HW Solution X Correct Efficient HW
Together, they had the best of both world ASIP SW Solutions HW Solutions Registers Memory (Program) Control Storage FSM Datapath Correct Efficient SW HW
The boss was very happy Use Software for Control Optimality/ integration (e.g. mW, $) special hardware ASIP Use Application- specific datapath for computation D ~10x traditional processors + SW Flexibility/modularity (e.g. time-to-market) D ~10x
Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIPs using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges
What Is “EEMBC”? • EDN Embedded Microprocessor Benchmark Consortium • Pronounced “Embassy” • Non-profit consortium, funded by over 40 members • Including: ARM, AMD, IBM, Intel, LSI Logic, MIPS, Motorola, National Semi, NEC, TI, Toshiba, Tensilica, and more • Objective: Provide independently certified benchmark scores relevant to deeply embedded processor applications • Independent laboratory recreates and certifies all benchmark results - no tricks
EEMBC Benchmark Suites • Five different benchmark suites • Consumer • Networking • Telecom • Automotive • Office Automation • Each suite comprised of a range (five to sixteen) ofbenchmarks representative of that product category • Example: Consumer • Image compression, image filtering, color conversion
Two Metrics: Out-of-box vs. Optimized • Out-of-Box • Benchmark C code, no manual code optimization,no assembly coding • Optimized, or “Full-Fury” • Conventional Processors • Laboriously hand-tuned assembly code • Rewriting C code to fit the architecture for VLIW or SIMD machines • Changing Code to Fit the Processor • Xtensa • Optimized processor using Xtensa processor generator and TIE Compiler • Changing Processor to Fit the Application!!
Xtensa Optimization Process • Step #1: Configure processor via generator GUI • Compile C-code, evaluate results • Modify configuration as needed • “Out of Box” results measurement taken here • Step #2: Profile Code, Add TIE • Step #3: Optimize Code to Utilize TIE instructions • “Optimized” results measured on final hardware configuration Same Path Used by Tensilica Customers!
Optimized Xtensa Configurations for EEMBC OUT-OF-BOX Configured Xtensa (Using GUI Click box options) Unmodified C-Code OPTIMIZED Configured Xtensa Plus TIE Gates & Instructions C-Code optimizations Consumer Configuration 25000 base gates + 37600 config. gates 200MHz 127K total gates 200MHz 62.6K 64.1K TIE Network Configuration 25000 base gates + 25000 config. gates 200MHz 59K total gates 200MHz 50K 9.2K TIE Telecom Configuration 25000 base gates + 37000 config Gates 200MHz 180K total gates 200MHz VECTRA 18K TIE Illustrations conceptual, see EEBMC report for full details
EEMBC Consumer Benchmark Consumermark Optimized Xtensa Out-of-box Xtensa Processors
EEMBC Consumer Benchmark Consumermark / MHz Optimized Xtensa Out-of-box Xtensa Processors
EEMBC Networking Benchmark Netmark AMD K6 Optimized Xtensa Out-of-box Xtensa Processors
EEMBC Networking Benchmark Netmark / MHz Optimized Xtensa Out-of-box Xtensa AMD K6 Processors
EEMBC Telecom Benchmark BOPS 2x2 Telemark Optimized Xtensa Out-of-box Xtensa Processors
EEMBC Telecom Benchmark BOPS 2x2 Telemark / MHz 1.67 Optimized Xtensa Out-of-box Xtensa Processors
Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIPs using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges
ASIP Generation Flow ALU I/O Timer Pipe Cache Register File MMU Tailored, synthesizable HDL uP core Select processor options ******* **** ******** *** Xtensa Processor Generator • Optimizing C/C++ Compiler • Cycle-accurate Simulator • Assembler • Linker • C/C++/asm/inst Debugger • RTOS Describe new instructions In Minutes!
Tensilica Instruction Extension (TIE) Lang. opcode PMAC op2=0 CUST0 state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} { assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16]; } schedule pmac_schd {PMAC} { use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2; }
Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIP using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges
Sample platforms Vitesse PRISM IQ2000 Intel IXP1200 Motorola C-Port CDP C-5 PMC-Sierra VoIP Gateway
Observations • Heterogeneous processing elements • General purpose processors • Micro-controllers • Dedicated blocks • Heterogeneous communication links • Bandwidth • Latency • Hardware overhead • Communication overhead
Two Legs Of Platform Design Platform Designer Processing Element Design Communication Design
Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIP using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges
ASIP requirements • Match the performance of hard-wired logic • Offer variety of performance/cost points • Easy to design • Easy to use
Fixed Processors Cannot Replace ASIC • Spatial bottleneck: • not enough bandwidth • Temporal bottleneck: • Limited functionality Decoder RF0 Source FU0 Control Result
Adding Customized Function Units to Break Temporal Bottleneck Decoder RF0 Source routing FU0 FU1 FU2 FU3 Control Result routing
Example of Customized Functional Unit opcode PMAC op2=0 CUST0 state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} { assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16]; } schedule pmac_schd {PMAC} { use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2; }
Effectiveness of Customized Functional Unit Requirements: • Performance - similar • Cost - similar • Ease of design – similar TIE: assign ACC1 = ACC1 + ars[15:0] * art[15:0]; • Ease of use – much easier C: PMAC(x, y);
Adding Processor States to Break Spatial Bottleneck RF0 FU0 FU1 FU2 FU3 Decoder S0 S1 Source routing Control Result routing
Example of Processor States opcode PMAC op2=0 CUST0 state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} { assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16]; } schedule pmac_schd {PMAC} { use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2; }
Effectiveness of Processor States Requirements: • Performance – better Especially when used with pipelined functional units • Cost – higher due to pipelined implementation • Ease of design – very simple state ACC1 40 • Ease of use – very easy PMAC(x, y); /* implicitly using the states */ x = R_ACC1_Lo(); W_ACC1_Hi(y);
Sharing States Using Register Files RF0 RF1 RF2 FU0 FU1 FU2 FU3 Decoder S0 S1 Source routing Control Result routing
Example of a Register File regfile RF24 24 16 r operand vs s {RF24[s]} operand vt t {RF24[t]} operand vr r {RF24[r]} iclass rrr {average} {out vr, in vs, in vt} reference average { wire [8:0] t2 = vs[23:16] + vt[23:16]; wire [8:0] t1 = vs[15:8] + vt[15:8]; wire [8:0] t0 = vs[7:0] + vt[7:0]; assign vr = {t2[8:1], t1[8:1], t0[8:1]}; } ctype rgb 24 32 RF24 Control
Crossing the HW/SW Boundary • Working with typed data: rgb x, y, z; /* C code */ • Letting C-Compiler allocate the registers z = average(x, y); /* assembly: average v1, v4, v6 */ • Letting C-Compiler spill the registers • Letting C-Compiler convert to/from other types yuv a, b; b = average (a, y); • Auto saved/restored on context switching
Effectiveness of Register File Requirements: • Performance – better Especially when used with pipelined functional units • Cost – higher due to pipelined implementation • Ease of design – very simple regfile RF24 24 16 r • Ease of use – very easy rgb x, y, z; z = average(x, y);
Multi-cycle Instructions S0 S1 RF0 RF1 RF2 FU0 FU1 FU2 FU3 Decoder Source routing Control Result routing