1 / 65

The Role Of ASIP In Programmable Platforms

The Role Of ASIP In Programmable Platforms. Outline. Using ASIP – a new design paradigm EEMBC – a case study Designing ASIP using Xtensa and TIE Addressing the needs of platforms ASIP computing capabilities ASIP communication capabilities Challenges. A short story of

joella
Download Presentation

The Role Of ASIP In Programmable Platforms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Role Of ASIP In Programmable Platforms

  2. Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIP using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges

  3. A short story of a design paradigm shift

  4. Once upon a time How do I solve the encryption problem?

  5. Data Encryption Standard (DES) Initial step (R, L) = Initial_permutation(Din64) Iterate 16 times Key generation (C, D) = PC1(k) n = rotate_amount (function of iteration count) C = rotate_right(C, n) D = rotate_right (D, n) K = PC2(D, C) Encryption R i+1 = Li Permutation ( S_Box ( K  Expansion ( R ) ) ) L i+1 = Ri Final step Dout64 = Final_permutation(L, R)

  6. The SW engineer very proudly presented static unsigned permute( unsigned char *table, in t n, unsigned hi, unsigned lo) { int ib, ob; unsigned out = 0; for (ob = 0; ob < n; ob++) { ib = table[ob] - 1; if (ib >= 32) { if (hi & (1 << (ib-32))) out |= 1 << ob; } else { if (lo & (1 << ib)) out |= 1 << ob; } } return out; } This code is fast

  7. The HW engineer laughed Initial step (R, L) = Initial_permutation(Din64) Iterate 16 times Key generation (C, D) = PC1(k) n = rotate_amount (function of iteration count) C = rotate_right(C, n) D = rotate_right (D, n) K = PC2(D, C) Encryption R i+1 = Li Permutation ( S_Box ( K  Expansion ( R ) ) ) L i+1 = Ri Final step Dout64 = Final_permutation(L, R) 200 cycles? I can do it in 1!!! ?

  8. The HW engineer presented Initial Permutation I’ll show you how fast it can be Expansion Permutation Key Generation  S Boxes State Machine P Permutation  Final Permutation

  9. The SW engineer laughed I can change this in 1 minute, can you? Initial Permutation Expansion Permutation Key Generation  ? S Boxes State Machine P Permutation  Final Permutation

  10. Realizing that they each had something the other wanted If only I don’t have to design the controller If only I have just the instruction I need

  11. They decided to work together SETDATA ars, art SETKEY ars, art DES immediate GETDATA ars, hilo Initial Permutation Expansion Permutation Key Generation  S Boxes State Machine P Permutation  Final Permutation

  12. and improved the SW solution by 70x Encryption Decryption SETKEY(K_hi, K_lo); for (;;) { … /* read data */ SETDATA(D_hi, D_lo); DES(ENCRYPT1); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT2); DES(ENCRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write encrypted data */ } SETKEY(K_hi, K_lo); for (;;) { … /* read encrypted data */ SETDATA(D_hi, D_lo); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT2); DES(DECRYPT1); DES(DECRYPT1); E_hi = GETDATA(hi); E_lo = GETDATA(lo); … /* write data */ }

  13. When the boss asked how,the SW engineer said:  X SW Solution Registers Memory (Program) Control Datapath  X Correct Efficient SW

  14. and the HW engineer said: Storage FSM  X HW Solution X  Correct Efficient HW

  15. Together, they had the best of both world   ASIP SW Solutions HW Solutions Registers Memory (Program) Control Storage FSM Datapath   Correct Efficient SW HW

  16. The boss was very happy Use Software for Control Optimality/ integration (e.g. mW, $) special hardware ASIP Use Application- specific datapath for computation D ~10x traditional processors + SW Flexibility/modularity (e.g. time-to-market) D ~10x

  17. And they worked together happily ever after

  18. Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIPs using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges

  19. What Is “EEMBC”? • EDN Embedded Microprocessor Benchmark Consortium • Pronounced “Embassy” • Non-profit consortium, funded by over 40 members • Including: ARM, AMD, IBM, Intel, LSI Logic, MIPS, Motorola, National Semi, NEC, TI, Toshiba, Tensilica, and more • Objective: Provide independently certified benchmark scores relevant to deeply embedded processor applications • Independent laboratory recreates and certifies all benchmark results - no tricks

  20. EEMBC Benchmark Suites • Five different benchmark suites • Consumer • Networking • Telecom • Automotive • Office Automation • Each suite comprised of a range (five to sixteen) ofbenchmarks representative of that product category • Example: Consumer • Image compression, image filtering, color conversion

  21. Two Metrics: Out-of-box vs. Optimized • Out-of-Box • Benchmark C code, no manual code optimization,no assembly coding • Optimized, or “Full-Fury” • Conventional Processors • Laboriously hand-tuned assembly code • Rewriting C code to fit the architecture for VLIW or SIMD machines • Changing Code to Fit the Processor • Xtensa • Optimized processor using Xtensa processor generator and TIE Compiler • Changing Processor to Fit the Application!!

  22. Xtensa Optimization Process • Step #1: Configure processor via generator GUI • Compile C-code, evaluate results • Modify configuration as needed • “Out of Box” results measurement taken here • Step #2: Profile Code, Add TIE • Step #3: Optimize Code to Utilize TIE instructions • “Optimized” results measured on final hardware configuration Same Path Used by Tensilica Customers!

  23. Optimized Xtensa Configurations for EEMBC OUT-OF-BOX Configured Xtensa (Using GUI Click box options) Unmodified C-Code OPTIMIZED Configured Xtensa Plus TIE Gates & Instructions C-Code optimizations Consumer Configuration 25000 base gates + 37600 config. gates 200MHz 127K total gates 200MHz 62.6K 64.1K TIE Network Configuration 25000 base gates + 25000 config. gates 200MHz 59K total gates 200MHz 50K 9.2K TIE Telecom Configuration 25000 base gates + 37000 config Gates 200MHz 180K total gates 200MHz VECTRA 18K TIE Illustrations conceptual, see EEBMC report for full details

  24. EEMBC Consumer Benchmark Consumermark Optimized Xtensa Out-of-box Xtensa Processors

  25. EEMBC Consumer Benchmark Consumermark / MHz Optimized Xtensa Out-of-box Xtensa Processors

  26. EEMBC Networking Benchmark Netmark AMD K6 Optimized Xtensa Out-of-box Xtensa Processors

  27. EEMBC Networking Benchmark Netmark / MHz Optimized Xtensa Out-of-box Xtensa AMD K6 Processors

  28. EEMBC Telecom Benchmark BOPS 2x2 Telemark Optimized Xtensa Out-of-box Xtensa Processors

  29. EEMBC Telecom Benchmark BOPS 2x2 Telemark / MHz 1.67 Optimized Xtensa Out-of-box Xtensa Processors

  30. Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIPs using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges

  31. ASIP Generation Flow ALU I/O Timer Pipe Cache Register File MMU Tailored, synthesizable HDL uP core Select processor options ******* **** ******** *** Xtensa Processor Generator • Optimizing C/C++ Compiler • Cycle-accurate Simulator • Assembler • Linker • C/C++/asm/inst Debugger • RTOS Describe new instructions In Minutes!

  32. Tensilica Instruction Extension (TIE) Lang. opcode PMAC op2=0 CUST0 state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} { assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16]; } schedule pmac_schd {PMAC} { use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2; }

  33. Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIP using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges

  34. Sample platforms Vitesse PRISM IQ2000 Intel IXP1200 Motorola C-Port CDP C-5 PMC-Sierra VoIP Gateway

  35. Observations • Heterogeneous processing elements • General purpose processors • Micro-controllers • Dedicated blocks • Heterogeneous communication links • Bandwidth • Latency • Hardware overhead • Communication overhead

  36. Two Legs Of Platform Design Platform Designer Processing Element Design Communication Design

  37. Outline • Using ASIP – a new design paradigm • EEMBC – a case study • Designing ASIP using Xtensa and TIE • Addressing the needs of platforms • ASIP computing capabilities • ASIP communication capabilities • Challenges

  38. ASIP requirements • Match the performance of hard-wired logic • Offer variety of performance/cost points • Easy to design • Easy to use

  39. Fixed Processors Cannot Replace ASIC • Spatial bottleneck: • not enough bandwidth • Temporal bottleneck: • Limited functionality Decoder RF0 Source FU0 Control Result

  40. Adding Customized Function Units to Break Temporal Bottleneck Decoder RF0 Source routing FU0 FU1 FU2 FU3 Control Result routing

  41. Example of Customized Functional Unit opcode PMAC op2=0 CUST0 state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} { assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16]; } schedule pmac_schd {PMAC} { use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2; }

  42. Effectiveness of Customized Functional Unit Requirements: • Performance - similar • Cost - similar • Ease of design – similar TIE: assign ACC1 = ACC1 + ars[15:0] * art[15:0]; • Ease of use – much easier C: PMAC(x, y);

  43. Adding Processor States to Break Spatial Bottleneck RF0 FU0 FU1 FU2 FU3 Decoder S0 S1 Source routing Control Result routing

  44. Example of Processor States opcode PMAC op2=0 CUST0 state ACC1 40 state ACC2 40 iclass rr {PMAC}{in ars, in art}{inout ACC1, inout ACC2} semantic pmac_sem {PMAC} { assign ACC1 = ACC1 + ars[15:0] * art[15:0]; assign ACC2 = ACC2 + ars[31:16] * art[31:16]; } schedule pmac_schd {PMAC} { use ars 1; use art 1; use ACC1 2; use ACC2 2; def ACC1 2; def ACC2 2; }

  45. Effectiveness of Processor States Requirements: • Performance – better Especially when used with pipelined functional units • Cost – higher due to pipelined implementation • Ease of design – very simple state ACC1 40 • Ease of use – very easy PMAC(x, y); /* implicitly using the states */ x = R_ACC1_Lo(); W_ACC1_Hi(y);

  46. Sharing States Using Register Files RF0 RF1 RF2 FU0 FU1 FU2 FU3 Decoder S0 S1 Source routing Control Result routing

  47. Example of a Register File regfile RF24 24 16 r operand vs s {RF24[s]} operand vt t {RF24[t]} operand vr r {RF24[r]} iclass rrr {average} {out vr, in vs, in vt} reference average { wire [8:0] t2 = vs[23:16] + vt[23:16]; wire [8:0] t1 = vs[15:8] + vt[15:8]; wire [8:0] t0 = vs[7:0] + vt[7:0]; assign vr = {t2[8:1], t1[8:1], t0[8:1]}; } ctype rgb 24 32 RF24 Control

  48. Crossing the HW/SW Boundary • Working with typed data: rgb x, y, z; /* C code */ • Letting C-Compiler allocate the registers z = average(x, y); /* assembly: average v1, v4, v6 */ • Letting C-Compiler spill the registers • Letting C-Compiler convert to/from other types yuv a, b; b = average (a, y); • Auto saved/restored on context switching

  49. Effectiveness of Register File Requirements: • Performance – better Especially when used with pipelined functional units • Cost – higher due to pipelined implementation • Ease of design – very simple regfile RF24 24 16 r • Ease of use – very easy rgb x, y, z; z = average(x, y);

  50. Multi-cycle Instructions S0 S1 RF0 RF1 RF2 FU0 FU1 FU2 FU3 Decoder Source routing Control Result routing

More Related