Embedded Processor Architecture

Embedded Processor Architecture 5kk73

flexibility efficiency DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor Embedded Processor Architecture Henk Corporaal / Bart Mesman

x4 x3 x2 x1 x0 Z-1 Z-1 Z-1 Z-1 c4 c3 c2 c1 c0 * * * * * + y Application examples (1) Embedded Processor Architecture Henk Corporaal / Bart Mesman

Application examples (1) 19 instructions per tap!! Embedded Processor Architecture Henk Corporaal / Bart Mesman

Application examples (2) Bit level operations: finite field arithmetic 10 instructions!! Very simple in hardware Embedded Processor Architecture Henk Corporaal / Bart Mesman

source register ($2) 27 26 25 23 22 20 srl $13, $2, 20 andi $25, $13, 1 srl $14, $2, 21 andi $24, $14, 6 or $15, $25, $24 srl $13, $2, 22 andi $14, $13, 56 or $25, $15, $14 sll $24, $25, 2 7 6 5 4 3 2 destination register ($24) Application examples (2) Bit level operations : DES example Embedded Processor Architecture Henk Corporaal / Bart Mesman

18 17 16 13 $5 srl $24, $5, 18 srl $25, $5, 17 xor $8, $24, $25 srl $9, $5, 16 xor xor $10, $8, $9 srl $11, $5, 13 xor $12, $10, $11 andi $13, $12, 1 … 0 ... 1 $13 Application examples (2) Bit level operations : A5 example (GSM encryption) Embedded Processor Architecture Henk Corporaal / Bart Mesman

Application examples: conclusions • CPUs offer flexibility, but… • not efficient in performance • not efficient in code size • not efficient in power consumption Embedded Processor Architecture Henk Corporaal / Bart Mesman

Power Consumption in microprocessors Power consumption is (becoming) the limiting factor in processor design Solution in direction of • Hardware acceleration • Instruction Level Parallelism instead of clock speed • Code size efficiency source: ISSCC2001, Patrick Gelsinger, Intel Embedded Processor Architecture Henk Corporaal / Bart Mesman

Amdahl’s law • Impact of an improvement on the execution time of a program depends on 2 parameters: • f = fraction of the original computation time that is affected by the improvement • s = speedup factor (local) • exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s • speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s) • if s >> 1 then speedup_overall = 1 / ( 1 – f ) • Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56 Embedded Processor Architecture Henk Corporaal / Bart Mesman

Conclusions • Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw) • Keep it Simple heuristic (RISC vs. CISC) • Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set • No special features that match a high level language construct. • At least 16 registers to ease register allocation. • Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance) Embedded Processor Architecture Henk Corporaal / Bart Mesman

real-time worst-case processing = need for more compute power • sec instr cycles sec • prog prog instr cycle CPI = 1 Programmable Digital Signal Processors • instruction level parallelism (ILP) • hardware support for loop control • attention for high level data types e.g. arrays, delaylines • (vs. scalars for CPUs) • difficult to compare architectures • e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, • shuffling, intialisation … can be included or forgotten • benchmarking (Berkeley Design Technology Inc (BDTi)) • (compare to SpecInt benchmarks for CPs) Embedded Processor Architecture Henk Corporaal / Bart Mesman

Outline • architectures for programmable DSPs • multiplier-accumulator • modified Harvard architecture • extension with an ALU (decision making) • controller architectures • examples: TI, Motorola, Philips • code generation • developments: VLIW (Very Long Instruction Word) • examples: C6 and TM Embedded Processor Architecture Henk Corporaal / Bart Mesman

DSP data types • not every signal requires 32 bits • 2 types of DSP: floating point and integer • advantages FP: most specs are in FP • (conversion to int is time consuming since the behavior • may change) • disadvantage FP: cost (area, speed, power) • integer multiplication doubles the number of bits: n * n => 2n Embedded Processor Architecture Henk Corporaal / Bart Mesman

c(i) x(i) control MPY (Booth, Wallace..) clock P_reg PR ADDER clock ACR P_reg SHIFT ROUND TRUNCATE Embedded Processor Architecture Henk Corporaal / Bart Mesman

Prog/data memory prog mem. data mem. prog mem. data mem. 1 data mem. 2 EXU EXU EXU Harvard Modified Harvard Von Neumann (sequential)  c(i) * x(i) Goal = 1 cycle per iteration Embedded Processor Architecture Henk Corporaal / Bart Mesman

Interrupt address Reset ACU_A ACU_B AR_A AR_B Stack +1 PC Program Memory DR_A DR_B IR Control Bus Rfile RAM_A RAM_B MAC Embedded Processor Architecture Henk Corporaal / Bart Mesman

time loop 1 cycle/tap ?  ci * xi filter loop i How updating the delayline ? x5 x4 x3 x2 x1 Z-1 Z-1 Z-1 Z-1 c5 c4 c3 c2 c1 * * * * * + y Embedded Processor Architecture Henk Corporaal / Bart Mesman

Solution 2: indirect adressing • use of a pointer to mark the begin of the delay line • problem: trashing of the whole memory • solution: modulo addressing • need for a register to store the pointer Embedded Processor Architecture Henk Corporaal / Bart Mesman

ACU architecture and Instruction set A S Output reg A reg S Read_A A A S Read_S S A S incA A+1 A+1 S decA A-1 A-1 S Step A+S A+S S Inc_step S+1 A S+1 Modulo 16 10 000 23 10 111 mask =hold Modulo can be implemented as a mask operation if the size is 2k output to RAM Embedded Processor Architecture Henk Corporaal / Bart Mesman

Addressing modes • register ADD R4, R3 R[R4] = R[R4] + R[R3] • immediate ADD R4, #3 R[R4] = R[R4] + #3 • direct ADD R4, (100) R[R4] = R[R4] + Mem[100] • indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] • w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± 1 • indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± R[R2] • Remarks • direct = for static data • indirect = for arrays • inc/dec = for stepping through arrays e.g.  xn • index = for stepping through arrays e.g.  x2n Embedded Processor Architecture Henk Corporaal / Bart Mesman

Addressing modes: extra for DSP • 8 ARs (address or auxiliary register) available • extra indirect modes • circular *ARn ± % post inc/dec by 1 - circular • *ARn ± AR0 % post inc/dec by AR0 - circular • bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev. Embedded Processor Architecture Henk Corporaal / Bart Mesman

Interrupt address Reset ACU_A ACU_B AR_A AR_B Stack +1 PC RAM_A RAM_B Program Memory DR_A DR_B IR MAC ALU Control Bus Rfile Embedded Processor Architecture Henk Corporaal / Bart Mesman

first solution  c(i) * x(i) Not shown coefficient RAM+ACU resources 6 clockcycles/sample limit pipelines in the controller time (cc) Embedded Processor Architecture Henk Corporaal / Bart Mesman

ai a0 f f f f f for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci) a1 g g g g g bi b0 ci-2 bi-1 ai for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2) a2 h h h h h ci c0 b1 c1 di d0 b2 di-2 ci-1 bi c2 d1 d2 Loopfolding (software pipelining) Embedded Processor Architecture Henk Corporaal / Bart Mesman

Loopfolding (software pipelining)  c(i) * x(i) Pre- and postamble 4 clockcycles /sample Embedded Processor Architecture Henk Corporaal / Bart Mesman

hardware support for loop control  c(i) * x(i) 1 clockcycles/sample repeat instruction and repeat block Embedded Processor Architecture Henk Corporaal / Bart Mesman

TMS320C5000 T register E D P C D T B A T C B A C D A D Sign ctr Sign ctr A(40) B(40) Sign ctr Sign ctr Sign ctr Multiplier (17*17) MUX A ALU (40) M U A B B 0 A B Barrer shifter fractional MUX MUX COMP Adder (40) MSW/LSW select TRN ZERO SAT ROUND TC Embedded Processor Architecture Henk Corporaal / Bart Mesman

Address bus 16 bits Motorola 56K family EXTERNAL ADRESS SWITCH P Address Y Address X Address 2,048-by-24-bit PROGRAM MEMORY ROM X memory 256-by-24-bit RAM 256-by-24-bit ROM Address ALU Y memory 256-by-24-bit RAM 256-by-24-bit ROM INTERNAL DATA-BUS SWITCH X-DATA EXTERNAL DATA-BUS SWITCH 24 BITS Y DATA DATA BUS P DATA GLOBAL DATA ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL INTERFACE SERIAL COMMU- NICATIONS INTERFACE, PROGRAMMED I/O, BUS CONTROL DATA ALU 24-by-24 bit MULTIPLIER- ACCUMULATOR PRODUCING 56 BIT RESULT 24 BITS PROGRAM CONTROLLER I/O PORTS 2 BITS 3 BITS 7 BITS CLOCK INTERRUPT Embedded Processor Architecture Henk Corporaal / Bart Mesman

X Y Two address Compution units X data memory Y data memory 16 bit bus 16 bit bus Two 16-by-16 bit multipliers Program control unit Y0 Y0 Y1 Y1 X X PO P1 scale scale 96-bit instructions Program memory (Z data) Instruction decoder Two 40 bit arithmic- logic units shift Saturation Saturation Four 40 bit accumulators 16-bit bus Saturation/scale R.E.A.L. X data Buses for Y data Z data Embedded Processor Architecture Henk Corporaal / Bart Mesman

source lexical analysis syntax analysis Front end semantic analysis Intermediate machine independent representation Code selection Register allocation Code generation scheduling 1 instr = // ops order of instr code Embedded Processor Architecture Henk Corporaal / Bart Mesman

Intermediate machine independent representation BBi BBk BBj a b c d * c + t1 := a * b t2 := c + d t3 := t1 + c out := t2 * t3 t1 t2 + t3 * Embedded Processor Architecture Henk Corporaal / Bart Mesman

Code selection example d memory p memory ADSP [Analog Devices] ax ay af mx my mf x y x y + - * MAC ALU + - ar mr Embedded Processor Architecture Henk Corporaal / Bart Mesman

Example of code selection = covering of intermediate representation with RTPs mx := dmem my := pmem ax := dmem ay := pmem a b c d mr := dmem * c + ar := ax + ay 3: t1 t2 + 2: Mr := mr + (mx * my) t3 * 1: my := ar mr = mr * my Embedded Processor Architecture Henk Corporaal / Bart Mesman

Problems • local decisions which have a global impact • phase coupling: example • asap schedule • maximal freedom for scheduling • code selection during scheduling • register allocation comes afterwards • can lead to infeasible solutions Embedded Processor Architecture Henk Corporaal / Bart Mesman

phase coupling: discussion It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler Solution: 1. Solve code generation for DSPs 2. Step back and rethink the architecture develop an architecture which is still efficient but also a good model for building a compiler Efficiency = exploit instruction level parallelism (ILP) compilation = systematic positioning of registers and regular interconnect = VLIW = Very Long Instruction Word Embedded Processor Architecture Henk Corporaal / Bart Mesman

Embedded Processor Architecture

Embedded Processor Architecture

Presentation Transcript

Embedded Processor Architecture

ARM Processor Architecture

Processor Architecture

Reconfigurable Embedded Processor Peripherals

Processor Architecture

Idempotent Processor Architecture

Embedded Computer Architecture

Embedded Computer Architecture

Basic Processor Architecture

Embedded Computer Architecture

Processor Architecture

Embedded Processor 의 설계

ARM Processor Architecture

Embedded Processor Architecture 5kk73

Embedded Systems: Hardware Computer Processor Basics ISA (Instruction Set Architecture)

Embedded Computer Architecture

80x86 Processor Architecture

80x86 Processor Architecture

Processor System Architecture

Embedded Computer Architecture

Embedded Systems Architecture