400 likes | 607 Views
Embedded Processor Architecture. 5kk73. flexibility. efficiency. DSP. Programmable CPU. Programmable DSP. Application specific instruction set processor (ASIP). Application specific processor. x4. x3. x2. x1. x0. Z -1. Z -1. Z -1. Z -1. c 4. c 3. c 2.
E N D
flexibility efficiency DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor Embedded Processor Architecture Henk Corporaal / Bart Mesman
x4 x3 x2 x1 x0 Z-1 Z-1 Z-1 Z-1 c4 c3 c2 c1 c0 * * * * * + y Application examples (1) Embedded Processor Architecture Henk Corporaal / Bart Mesman
Application examples (1) 19 instructions per tap!! Embedded Processor Architecture Henk Corporaal / Bart Mesman
Application examples (2) Bit level operations: finite field arithmetic 10 instructions!! Very simple in hardware Embedded Processor Architecture Henk Corporaal / Bart Mesman
source register ($2) 27 26 25 23 22 20 srl $13, $2, 20 andi $25, $13, 1 srl $14, $2, 21 andi $24, $14, 6 or $15, $25, $24 srl $13, $2, 22 andi $14, $13, 56 or $25, $15, $14 sll $24, $25, 2 7 6 5 4 3 2 destination register ($24) Application examples (2) Bit level operations : DES example Embedded Processor Architecture Henk Corporaal / Bart Mesman
18 17 16 13 $5 srl $24, $5, 18 srl $25, $5, 17 xor $8, $24, $25 srl $9, $5, 16 xor xor $10, $8, $9 srl $11, $5, 13 xor $12, $10, $11 andi $13, $12, 1 … 0 ... 1 $13 Application examples (2) Bit level operations : A5 example (GSM encryption) Embedded Processor Architecture Henk Corporaal / Bart Mesman
Application examples: conclusions • CPUs offer flexibility, but… • not efficient in performance • not efficient in code size • not efficient in power consumption Embedded Processor Architecture Henk Corporaal / Bart Mesman
Power Consumption in microprocessors Power consumption is (becoming) the limiting factor in processor design Solution in direction of • Hardware acceleration • Instruction Level Parallelism instead of clock speed • Code size efficiency source: ISSCC2001, Patrick Gelsinger, Intel Embedded Processor Architecture Henk Corporaal / Bart Mesman
Amdahl’s law • Impact of an improvement on the execution time of a program depends on 2 parameters: • f = fraction of the original computation time that is affected by the improvement • s = speedup factor (local) • exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s • speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s) • if s >> 1 then speedup_overall = 1 / ( 1 – f ) • Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56 Embedded Processor Architecture Henk Corporaal / Bart Mesman
Conclusions • Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw) • Keep it Simple heuristic (RISC vs. CISC) • Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set • No special features that match a high level language construct. • At least 16 registers to ease register allocation. • Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance) Embedded Processor Architecture Henk Corporaal / Bart Mesman
real-time worst-case processing = need for more compute power • sec instr cycles sec • prog prog instr cycle CPI = 1 Programmable Digital Signal Processors • instruction level parallelism (ILP) • hardware support for loop control • attention for high level data types e.g. arrays, delaylines • (vs. scalars for CPUs) • difficult to compare architectures • e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, • shuffling, intialisation … can be included or forgotten • benchmarking (Berkeley Design Technology Inc (BDTi)) • (compare to SpecInt benchmarks for CPs) Embedded Processor Architecture Henk Corporaal / Bart Mesman
Outline • architectures for programmable DSPs • multiplier-accumulator • modified Harvard architecture • extension with an ALU (decision making) • controller architectures • examples: TI, Motorola, Philips • code generation • developments: VLIW (Very Long Instruction Word) • examples: C6 and TM Embedded Processor Architecture Henk Corporaal / Bart Mesman
DSP data types • not every signal requires 32 bits • 2 types of DSP: floating point and integer • advantages FP: most specs are in FP • (conversion to int is time consuming since the behavior • may change) • disadvantage FP: cost (area, speed, power) • integer multiplication doubles the number of bits: n * n => 2n Embedded Processor Architecture Henk Corporaal / Bart Mesman
c(i) x(i) control MPY (Booth, Wallace..) clock P_reg PR ADDER clock ACR P_reg SHIFT ROUND TRUNCATE Embedded Processor Architecture Henk Corporaal / Bart Mesman
Prog/data memory prog mem. data mem. prog mem. data mem. 1 data mem. 2 EXU EXU EXU Harvard Modified Harvard Von Neumann (sequential) c(i) * x(i) Goal = 1 cycle per iteration Embedded Processor Architecture Henk Corporaal / Bart Mesman
Interrupt address Reset ACU_A ACU_B AR_A AR_B Stack +1 PC Program Memory DR_A DR_B IR Control Bus Rfile RAM_A RAM_B MAC Embedded Processor Architecture Henk Corporaal / Bart Mesman
time loop 1 cycle/tap ? ci * xi filter loop i How updating the delayline ? x5 x4 x3 x2 x1 Z-1 Z-1 Z-1 Z-1 c5 c4 c3 c2 c1 * * * * * + y Embedded Processor Architecture Henk Corporaal / Bart Mesman
Solution 2: indirect adressing • use of a pointer to mark the begin of the delay line • problem: trashing of the whole memory • solution: modulo addressing • need for a register to store the pointer Embedded Processor Architecture Henk Corporaal / Bart Mesman
ACU architecture and Instruction set A S Output reg A reg S Read_A A A S Read_S S A S incA A+1 A+1 S decA A-1 A-1 S Step A+S A+S S Inc_step S+1 A S+1 Modulo 16 10 000 23 10 111 mask =hold Modulo can be implemented as a mask operation if the size is 2k output to RAM Embedded Processor Architecture Henk Corporaal / Bart Mesman
Addressing modes • register ADD R4, R3 R[R4] = R[R4] + R[R3] • immediate ADD R4, #3 R[R4] = R[R4] + #3 • direct ADD R4, (100) R[R4] = R[R4] + Mem[100] • indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] • w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± 1 • indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± R[R2] • Remarks • direct = for static data • indirect = for arrays • inc/dec = for stepping through arrays e.g. xn • index = for stepping through arrays e.g. x2n Embedded Processor Architecture Henk Corporaal / Bart Mesman
Addressing modes: extra for DSP • 8 ARs (address or auxiliary register) available • extra indirect modes • circular *ARn ± % post inc/dec by 1 - circular • *ARn ± AR0 % post inc/dec by AR0 - circular • bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev. Embedded Processor Architecture Henk Corporaal / Bart Mesman
Interrupt address Reset ACU_A ACU_B AR_A AR_B Stack +1 PC RAM_A RAM_B Program Memory DR_A DR_B IR MAC ALU Control Bus Rfile Embedded Processor Architecture Henk Corporaal / Bart Mesman
first solution c(i) * x(i) Not shown coefficient RAM+ACU resources 6 clockcycles/sample limit pipelines in the controller time (cc) Embedded Processor Architecture Henk Corporaal / Bart Mesman
ai a0 f f f f f for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci) a1 g g g g g bi b0 ci-2 bi-1 ai for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2) a2 h h h h h ci c0 b1 c1 di d0 b2 di-2 ci-1 bi c2 d1 d2 Loopfolding (software pipelining) Embedded Processor Architecture Henk Corporaal / Bart Mesman
Loopfolding (software pipelining) c(i) * x(i) Pre- and postamble 4 clockcycles /sample Embedded Processor Architecture Henk Corporaal / Bart Mesman
hardware support for loop control c(i) * x(i) 1 clockcycles/sample repeat instruction and repeat block Embedded Processor Architecture Henk Corporaal / Bart Mesman
TMS320C5000 T register E D P C D T B A T C B A C D A D Sign ctr Sign ctr A(40) B(40) Sign ctr Sign ctr Sign ctr Multiplier (17*17) MUX A ALU (40) M U A B B 0 A B Barrer shifter fractional MUX MUX COMP Adder (40) MSW/LSW select TRN ZERO SAT ROUND TC Embedded Processor Architecture Henk Corporaal / Bart Mesman
Address bus 16 bits Motorola 56K family EXTERNAL ADRESS SWITCH P Address Y Address X Address 2,048-by-24-bit PROGRAM MEMORY ROM X memory 256-by-24-bit RAM 256-by-24-bit ROM Address ALU Y memory 256-by-24-bit RAM 256-by-24-bit ROM INTERNAL DATA-BUS SWITCH X-DATA EXTERNAL DATA-BUS SWITCH 24 BITS Y DATA DATA BUS P DATA GLOBAL DATA ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGRAMMED I/O, BUS CONTROL DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRODUCING 56 BIT RESULT 24 BITS PROGRAM CONTROLLER I/O PORTS 2 BITS 3 BITS 7 BITS CLOCK INTERRUPT Embedded Processor Architecture Henk Corporaal / Bart Mesman
X Y Two address Compution units X data memory Y data memory 16 bit bus 16 bit bus Two 16-by-16 bit multipliers Program control unit Y0 Y0 Y1 Y1 X X PO P1 scale scale 96-bit instructions Program memory (Z data) Instruction decoder Two 40 bit arithmic- logic units shift Saturation Saturation Four 40 bit accumulators 16-bit bus Saturation/scale R.E.A.L. X data Buses for Y data Z data Embedded Processor Architecture Henk Corporaal / Bart Mesman
source lexical analysis syntax analysis Front end semantic analysis Intermediate machine independent representation Code selection Register allocation Code generation scheduling 1 instr = // ops order of instr code Embedded Processor Architecture Henk Corporaal / Bart Mesman
Intermediate machine independent representation BBi BBk BBj a b c d * c + t1 := a * b t2 := c + d t3 := t1 + c out := t2 * t3 t1 t2 + t3 * Embedded Processor Architecture Henk Corporaal / Bart Mesman
Code selection example d memory p memory ADSP [Analog Devices] ax ay af mx my mf x y x y + - * MAC ALU + - ar mr Embedded Processor Architecture Henk Corporaal / Bart Mesman
Example of code selection = covering of intermediate representation with RTPs mx := dmem my := pmem ax := dmem ay := pmem a b c d mr := dmem * c + ar := ax + ay 3: t1 t2 + 2: Mr := mr + (mx * my) t3 * 1: my := ar mr = mr * my Embedded Processor Architecture Henk Corporaal / Bart Mesman
Problems • local decisions which have a global impact • phase coupling: example • asap schedule • maximal freedom for scheduling • code selection during scheduling • register allocation comes afterwards • can lead to infeasible solutions Embedded Processor Architecture Henk Corporaal / Bart Mesman
phase coupling: discussion It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler Solution: 1. Solve code generation for DSPs 2. Step back and rethink the architecture develop an architecture which is still efficient but also a good model for building a compiler Efficiency = exploit instruction level parallelism (ILP) compilation = systematic positioning of registers and regular interconnect = VLIW = Very Long Instruction Word Embedded Processor Architecture Henk Corporaal / Bart Mesman