Platform-based Design

Platform-based Design Digital Signal Processors TU/e 5kk70 Henk Corporaal Bart Mesman Processor Architectures and Program Mapping H. Corporaal and B. Mesman

flexibility efficiency DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor Processor Architectures and Program Mapping H. Corporaal and B. Mesman

x4 x3 x2 x1 x0 Z-1 Z-1 Z-1 Z-1 c4 c3 c2 c1 c0 * * * * * + y Application examples (1) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Application examples (1) 19 instructions per tap!! Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Application examples (2) Bit level operations: finite field arithmetic 10 instructions!! Very simple in hardware Processor Architectures and Program Mapping H. Corporaal and B. Mesman

source register ($2) 27 26 25 23 22 20 srl $13, $2, 20 andi $25, $13, 1 srl $14, $2, 21 andi $24, $14, 6 or $15, $25, $24 srl $13, $2, 22 andi $14, $13, 56 or $25, $15, $14 sll $24, $25, 2 7 6 5 4 3 2 destination register ($24) Application examples (2) Bit level operations : DES example Processor Architectures and Program Mapping H. Corporaal and B. Mesman

18 17 16 13 $5 srl $24, $5, 18 srl $25, $5, 17 xor $8, $24, $25 srl $9, $5, 16 xor xor $10, $8, $9 srl $11, $5, 13 xor $12, $10, $11 andi $13, $12, 1 … 0 ... 1 $13 Application examples (2) Bit level operations : A5 example (GSM encryption) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Application examples: conclusions • CPUs offer flexibility, but… • not efficient in performance • not efficient in code size • not efficient in power consumption Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Power Consumption in microprocessors Power consumption is (becoming) the limiting factor in processor design Solution in direction of • Hardware acceleration • Instruction Level Parallelism instead of clock speed • Code size efficiency source: ISSCC2001, Patrick Gelsinger, Intel Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Amdahl’s law • Impact of an improvement on the execution time of a program depends on 2 parameters: • f = fraction of the original computation time that is affected by the improvement • s = speedup factor (local) • exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s • speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s) • if s >> 1 then speedup_overall = 1 / ( 1 – f ) • Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56 Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Conclusions • Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw) • Keep it Simple heuristic (RISC vs. CISC) • Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set • No special features that match a high level language construct. • At least 16 registers to ease register allocation. • Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

real-time worst-case processing = need for more compute power • sec instr cycles sec • prog prog instr cycle CPI = 1 Programmable Digital Signal Processors • instruction level parallelism (ILP) • hardware support for loop control • attention for high level data types e.g. arrays, delaylines • (vs. scalars for CPUs) • difficult to compare architectures • e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, • shuffling, intialisation … can be included or forgotten • benchmarking (Berkeley Design Technology Inc (BDTi)) • (compare to SpecInt benchmarks for CPs) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Outline • architectures for programmable DSPs • multiplier-accumulator • modified Harvard architecture • extension with an ALU (decision making) • controller architectures • examples: TI, Motorola, Philips • code generation • developments: VLIW (Very Long Instruction Word) • examples: C6 and TM Processor Architectures and Program Mapping H. Corporaal and B. Mesman

DSP data types • not every signal requires 32 bits • 2 types of DSP: floating point and integer • advantages FP: most specs are in FP • (conversion to int is time consuming since the behavior • may change) • disadvantage FP: cost (area, speed, power) • integer multiplication doubles the number of bits: n * n => 2n Processor Architectures and Program Mapping H. Corporaal and B. Mesman

c(i) x(i) control MPY (Booth, Wallace..) clock P_reg PR ADDER clock ACR P_reg SHIFT ROUND TRUNCATE Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Prog/data memory prog mem. data mem. prog mem. data mem. 1 data mem. 2 EXU EXU EXU Harvard Modified Harvard Von Neumann (sequential)  c(i) * x(i) Goal = 1 cycle per iteration Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Interrupt address Reset ACU_A ACU_B AR_A AR_B Stack +1 PC Program Memory DR_A DR_B IR Control Bus Rfile RAM_A RAM_B MAC Processor Architectures and Program Mapping H. Corporaal and B. Mesman

time loop 1 cycle/tap ?  ci * xi filter loop i How updating the delayline ? x5 x4 x3 x2 x1 Z-1 Z-1 Z-1 Z-1 c5 c4 c3 c2 c1 * * * * * + y Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Solution 2: indirect adressing • use of a pointer to mark the begin of the delay line • problem: trashing of the whole memory • solution: modulo addressing • need for a register to store the pointer Processor Architectures and Program Mapping H. Corporaal and B. Mesman

ACU architecture and Instruction set A S Output reg A reg S Read_A A A S Read_S S A S incA A+1 A+1 S decA A-1 A-1 S Step A+S A+S S Inc_step S+1 A S+1 Modulo 16 10 000 23 10 111 mask =hold Modulo can be implemented as a mask operation if the size is 2k output to RAM Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Addressing modes • register ADD R4, R3 R[R4] = R[R4] + R[R3] • immediate ADD R4, #3 R[R4] = R[R4] + #3 • direct ADD R4, (100) R[R4] = R[R4] + Mem[100] • indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] • w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± 1 • indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± R[R2] • Remarks • direct = for static data • indirect = for arrays • inc/dec = for stepping through arrays e.g.  xn • index = for stepping through arrays e.g.  x2n Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Addressing modes: extra for DSP • 8 ARs (address or auxiliary register) available • extra indirect modes • circular *ARn ± % post inc/dec by 1 - circular • *ARn ± AR0 % post inc/dec by AR0 - circular • bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev. Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Interrupt address Reset ACU_A ACU_B AR_A AR_B Stack +1 PC RAM_A RAM_B Program Memory DR_A DR_B IR MAC ALU Control Bus Rfile Processor Architectures and Program Mapping H. Corporaal and B. Mesman

first solution  c(i) * x(i) Not shown coefficient RAM+ACU resources 6 clockcycles/sample limit pipelines in the controller time (cc) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

ai a0 f f f f f for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci) a1 g g g g g bi b0 ci-2 bi-1 ai for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2) a2 h h h h h ci c0 b1 c1 di d0 b2 di-2 ci-1 bi c2 d1 d2 Loopfolding (software pipelining) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Loopfolding (software pipelining)  c(i) * x(i) Pre- and postamble 4 clockcycles /sample Processor Architectures and Program Mapping H. Corporaal and B. Mesman

hardware support for loop control  c(i) * x(i) 1 clockcycles/sample repeat instruction and repeat block Processor Architectures and Program Mapping H. Corporaal and B. Mesman

TMS320C5000 T register E D P C D T B A T C B A C D A D Sign ctr Sign ctr A(40) B(40) Sign ctr Sign ctr Sign ctr Multiplier (17*17) MUX A ALU (40) M U A B B 0 A B Barrer shifter fractional MUX MUX COMP Adder (40) MSW/LSW select TRN ZERO SAT ROUND TC Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Address bus 16 bits Motorola 56K family EXTERNAL ADRESS SWITCH P Address Y Address X Address 2,048-by-24-bit PROGRAM MEMORY ROM X memory 256-by-24-bit RAM 256-by-24-bit ROM Address ALU Y memory 256-by-24-bit RAM 256-by-24-bit ROM INTERNAL DATA-BUS SWITCH X-DATA EXTERNAL DATA-BUS SWITCH 24 BITS Y DATA DATA BUS P DATA GLOBAL DATA ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGRAMMED I/O, BUS CONTROL DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRODUCING 56 BIT RESULT 24 BITS PROGRAM CONTROLLER I/O PORTS 2 BITS 3 BITS 7 BITS CLOCK INTERRUPT Processor Architectures and Program Mapping H. Corporaal and B. Mesman

X Y Two address Compution units X data memory Y data memory 16 bit bus 16 bit bus Two 16-by-16 bit multipliers Program control unit Y0 Y0 Y1 Y1 X X PO P1 scale scale 96-bit instructions Program memory (Z data) Instruction decoder Two 40 bit arithmic- logic units shift Saturation Saturation Four 40 bit accumulators 16-bit bus Saturation/scale R.E.A.L. X data Buses for Y data Z data Processor Architectures and Program Mapping H. Corporaal and B. Mesman

source lexical analysis syntax analysis Front end semantic analysis Intermediate machine independent representation Code selection Register allocation Code generation scheduling 1 instr = // ops order of instr code Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Intermediate machine independent representation BBi BBk BBj a b c d * c + t1 := a * b t2 := c + d t3 := t1 + c out := t2 * t3 t1 t2 + t3 * Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Code selection example d memory p memory ADSP [Analog Devices] ax ay af mx my mf x y x y + - * MAC ALU + - ar mr Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Example of code selection = covering of intermediate representation with RTPs mx := dmem my := pmem ax := dmem ay := pmem a b c d mr := dmem * c + ar := ax + ay 3: t1 t2 + 2: Mr := mr + (mx * my) t3 * 1: my := ar mr = mr * my Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Problems • local decisions which have a global impact • phase coupling: example • asap schedule • maximal freedom for scheduling • code selection during scheduling • register allocation comes afterwards • can lead to infeasible solutions Processor Architectures and Program Mapping H. Corporaal and B. Mesman

phase coupling: discussion It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler Solution: 1. Solve code generation for DSPs 2. Step back and rethink the architecture develop an architecture which is still efficient but also a good model for building a compiler Efficiency = exploit instruction level parallelism (ILP) compilation = systematic positioning of registers and regular interconnect = VLIW = Very Long Instruction Word Processor Architectures and Program Mapping H. Corporaal and B. Mesman

Platform-based Design