1 / 36

Platform-based Design

Learn about platform-based design, flexibility, and efficiency in DSP systems, including application-specific instruction set processors, hardware acceleration, and power consumption considerations. Explore examples, conclusions, and Amdahl's law in processor design.

peacheyj
Download Presentation

Platform-based Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Platform-based Design Digital Signal Processors TU/e 5kk70 Henk Corporaal Bart Mesman Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  2. flexibility efficiency DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  3. x4 x3 x2 x1 x0 Z-1 Z-1 Z-1 Z-1 c4 c3 c2 c1 c0 * * * * * + y Application examples (1) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  4. Application examples (1) 19 instructions per tap!! Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  5. Application examples (2) Bit level operations: finite field arithmetic 10 instructions!! Very simple in hardware Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  6. source register ($2) 27 26 25 23 22 20 srl $13, $2, 20 andi $25, $13, 1 srl $14, $2, 21 andi $24, $14, 6 or $15, $25, $24 srl $13, $2, 22 andi $14, $13, 56 or $25, $15, $14 sll $24, $25, 2 7 6 5 4 3 2 destination register ($24) Application examples (2) Bit level operations : DES example Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  7. 18 17 16 13 $5 srl $24, $5, 18 srl $25, $5, 17 xor $8, $24, $25 srl $9, $5, 16 xor xor $10, $8, $9 srl $11, $5, 13 xor $12, $10, $11 andi $13, $12, 1 … 0 ... 1 $13 Application examples (2) Bit level operations : A5 example (GSM encryption) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  8. Application examples: conclusions • CPUs offer flexibility, but… • not efficient in performance • not efficient in code size • not efficient in power consumption Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  9. Power Consumption in microprocessors Power consumption is (becoming) the limiting factor in processor design Solution in direction of • Hardware acceleration • Instruction Level Parallelism instead of clock speed • Code size efficiency source: ISSCC2001, Patrick Gelsinger, Intel Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  10. Amdahl’s law • Impact of an improvement on the execution time of a program depends on 2 parameters: • f = fraction of the original computation time that is affected by the improvement • s = speedup factor (local) • exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s • speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s) • if s >> 1 then speedup_overall = 1 / ( 1 – f ) • Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56 Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  11. Conclusions • Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw) • Keep it Simple heuristic (RISC vs. CISC) • Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set • No special features that match a high level language construct. • At least 16 registers to ease register allocation. • Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  12. real-time worst-case processing = need for more compute power • sec instr cycles sec • prog prog instr cycle CPI = 1 Programmable Digital Signal Processors • instruction level parallelism (ILP) • hardware support for loop control • attention for high level data types e.g. arrays, delaylines • (vs. scalars for CPUs) • difficult to compare architectures • e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, • shuffling, intialisation … can be included or forgotten • benchmarking (Berkeley Design Technology Inc (BDTi)) • (compare to SpecInt benchmarks for CPs) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  13. Outline • architectures for programmable DSPs • multiplier-accumulator • modified Harvard architecture • extension with an ALU (decision making) • controller architectures • examples: TI, Motorola, Philips • code generation • developments: VLIW (Very Long Instruction Word) • examples: C6 and TM Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  14. DSP data types • not every signal requires 32 bits • 2 types of DSP: floating point and integer • advantages FP: most specs are in FP • (conversion to int is time consuming since the behavior • may change) • disadvantage FP: cost (area, speed, power) • integer multiplication doubles the number of bits: n * n => 2n Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  15. c(i) x(i) control MPY (Booth, Wallace..) clock P_reg PR ADDER clock ACR P_reg SHIFT ROUND TRUNCATE Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  16. Prog/data memory prog mem. data mem. prog mem. data mem. 1 data mem. 2 EXU EXU EXU Harvard Modified Harvard Von Neumann (sequential)  c(i) * x(i) Goal = 1 cycle per iteration Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  17. Interrupt address Reset ACU_A ACU_B AR_A AR_B Stack +1 PC Program Memory DR_A DR_B IR Control Bus Rfile RAM_A RAM_B MAC Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  18. time loop 1 cycle/tap ?  ci * xi filter loop i How updating the delayline ? x5 x4 x3 x2 x1 Z-1 Z-1 Z-1 Z-1 c5 c4 c3 c2 c1 * * * * * + y Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  19. Solution 2: indirect adressing • use of a pointer to mark the begin of the delay line • problem: trashing of the whole memory • solution: modulo addressing • need for a register to store the pointer Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  20. ACU architecture and Instruction set A S Output reg A reg S Read_A A A S Read_S S A S incA A+1 A+1 S decA A-1 A-1 S Step A+S A+S S Inc_step S+1 A S+1 Modulo 16 10 000 23 10 111 mask =hold Modulo can be implemented as a mask operation if the size is 2k output to RAM Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  21. Addressing modes • register ADD R4, R3 R[R4] = R[R4] + R[R3] • immediate ADD R4, #3 R[R4] = R[R4] + #3 • direct ADD R4, (100) R[R4] = R[R4] + Mem[100] • indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] • w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± 1 • indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± R[R2] • Remarks • direct = for static data • indirect = for arrays • inc/dec = for stepping through arrays e.g.  xn • index = for stepping through arrays e.g.  x2n Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  22. Addressing modes: extra for DSP • 8 ARs (address or auxiliary register) available • extra indirect modes • circular *ARn ± % post inc/dec by 1 - circular • *ARn ± AR0 % post inc/dec by AR0 - circular • bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev. Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  23. Interrupt address Reset ACU_A ACU_B AR_A AR_B Stack +1 PC RAM_A RAM_B Program Memory DR_A DR_B IR MAC ALU Control Bus Rfile Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  24. first solution  c(i) * x(i) Not shown coefficient RAM+ACU resources 6 clockcycles/sample limit pipelines in the controller time (cc) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  25. ai a0 f f f f f for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci) a1 g g g g g bi b0 ci-2 bi-1 ai for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2) a2 h h h h h ci c0 b1 c1 di d0 b2 di-2 ci-1 bi c2 d1 d2 Loopfolding (software pipelining) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  26. Loopfolding (software pipelining)  c(i) * x(i) Pre- and postamble 4 clockcycles /sample Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  27. hardware support for loop control  c(i) * x(i) 1 clockcycles/sample repeat instruction and repeat block Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  28. TMS320C5000 T register E D P C D T B A T C B A C D A D Sign ctr Sign ctr A(40) B(40) Sign ctr Sign ctr Sign ctr Multiplier (17*17) MUX A ALU (40) M U A B B 0 A B Barrer shifter fractional MUX MUX COMP Adder (40) MSW/LSW select TRN ZERO SAT ROUND TC Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  29. Address bus 16 bits Motorola 56K family EXTERNAL ADRESS SWITCH P Address Y Address X Address 2,048-by-24-bit PROGRAM MEMORY ROM X memory 256-by-24-bit RAM 256-by-24-bit ROM Address ALU Y memory 256-by-24-bit RAM 256-by-24-bit ROM INTERNAL DATA-BUS SWITCH X-DATA EXTERNAL DATA-BUS SWITCH 24 BITS Y DATA DATA BUS P DATA GLOBAL DATA ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGRAMMED I/O, BUS CONTROL DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRODUCING 56 BIT RESULT 24 BITS PROGRAM CONTROLLER I/O PORTS 2 BITS 3 BITS 7 BITS CLOCK INTERRUPT Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  30. X Y Two address Compution units X data memory Y data memory 16 bit bus 16 bit bus Two 16-by-16 bit multipliers Program control unit Y0 Y0 Y1 Y1 X X PO P1 scale scale 96-bit instructions Program memory (Z data) Instruction decoder Two 40 bit arithmic- logic units shift Saturation Saturation Four 40 bit accumulators 16-bit bus Saturation/scale R.E.A.L. X data Buses for Y data Z data Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  31. source lexical analysis syntax analysis Front end semantic analysis Intermediate machine independent representation Code selection Register allocation Code generation scheduling 1 instr = // ops order of instr code Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  32. Intermediate machine independent representation BBi BBk BBj a b c d * c + t1 := a * b t2 := c + d t3 := t1 + c out := t2 * t3 t1 t2 + t3 * Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  33. Code selection example d memory p memory ADSP [Analog Devices] ax ay af mx my mf x y x y + - * MAC ALU + - ar mr Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  34. Example of code selection = covering of intermediate representation with RTPs mx := dmem my := pmem ax := dmem ay := pmem a b c d mr := dmem * c + ar := ax + ay 3: t1 t2 + 2: Mr := mr + (mx * my) t3 * 1: my := ar mr = mr * my Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  35. Problems • local decisions which have a global impact • phase coupling: example • asap schedule • maximal freedom for scheduling • code selection during scheduling • register allocation comes afterwards • can lead to infeasible solutions Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  36. phase coupling: discussion It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler Solution: 1. Solve code generation for DSPs 2. Step back and rethink the architecture develop an architecture which is still efficient but also a good model for building a compiler Efficiency = exploit instruction level parallelism (ILP) compilation = systematic positioning of registers and regular interconnect = VLIW = Very Long Instruction Word Processor Architectures and Program Mapping H. Corporaal and B. Mesman

More Related