660 likes | 880 Views
Embedded Processor Architecture. Bart Mesman Henk Corporaal 5kk73 2010. flexibility. efficiency. DSP. Programmable CPU. Programmable DSP. Application specific instruction set processor (ASIP). Application specific processor. efficiency. ASIC. high medium
E N D
Embedded Processor Architecture Bart Mesman Henk Corporaal 5kk73 2010
flexibility efficiency DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor Processor Architectures and Program Mapping H. Corporaal and B. Mesman
efficiency ASIC high medium low ASIP DSP GP proc FPGA low medium high flexibility Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Programmable CPU cores • introduction • architecture of the MIPS core • discussed as an example • pipelining • application examples • software issues • comparison between different CPU cores • towards application specific architectures • discussion Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Introduction • rationale: General-purpose -> large market • consequence: often handcrafted design optimised for clock rate • problem : fast changes in the IC process technology • examples embedded: • MIPS (first one, licensing instruction set architecture) • ARM (Advanced Risc Machines, telecom, low power, • small code size, most popular one, licensing also • the micro-architecture as hard or soft IP) • derivatives from general purpose CPUs • Intel, NEC, Hitachi, National, PowerPC Processor Architectures and Program Mapping H. Corporaal and B. Mesman
general purpose registers stack machines (e.g. ST20) accumulator machines register-register = load-store register-memory Introduction Instruction set architectures implicit operands explicit operands Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Clk PC Instruction address Instruction Memory Instruction Rd Rt Rs Imm 5 5 5 16 32 Rw Ra Rb 32 32-bit registers Data address Data Memory 32 32 Data out Data in 32 Clk 32 Clk Architecture of the MIPS core [Hennessy& Patterson] Processor Architectures and Program Mapping H. Corporaal and B. Mesman
31 26 21 16 11 6 0 Op rs rt rd shamt funct R - type 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 31 26 21 16 0 Op rs rt immediate I - type 6 bits 5 bits 5 bits 16 bits 31 26 0 Op target address J - type 6 bits 26 bits MIPS instruction formats ( 32 bits ) [Hennessy& Patterson] op operation of the instruction rs,rt,rd source and destination registers shamt shift amount funct operation of the instruction-part 2 imm for program constants addr target address of a jump Processor Architectures and Program Mapping H. Corporaal and B. Mesman
31 26 21 16 11 6 0 Op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits • add rd, rs, rt • mem[PC] • R[rd] = R[rs] + R[rt] • PC = PC + 4 Rd Rt Rs 5 Reg Wr 5 5 ALUctr BusA 32 Rw Ra Rb 32 32-bit registers Bus W Result 32 32 BusB 32 Clk Example 1 : R - type : add instruction [Hennessy& Patterson] Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Critical path R-type operation Clk PC [Hennessy& Patterson] Instruction address Instruction Memory Instruction Rd Rt Rs Imm 5 5 5 16 32 Rw Ra Rb 32 32-bit registers Data address Data Memory 32 32 Data out Data in Clk 32 Clk Processor Architectures and Program Mapping H. Corporaal and B. Mesman
31 26 21 16 0 Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Rd Rt RedDst dc (Rt) Rs 5 Reg Wr 5 5 ALUctr MemtoReg BusA 32 Rw Ra Rb 32 32-bit registers Bus W Result 32 32 MemWr BusB 32 Clk WrEn Adr Data Memory Data In 32 Imm 16 16 32 Extender Clk ExtOp ALUSrc Example 2 : I-type : load word [Hennessy& Patterson] • lw rs, rt, imm16 • mem[PC] • addr = R[rs] + ext[imm16] • R[rt] = mem[addr] • PC = PC + 4 Processor Architectures and Program Mapping H. Corporaal and B. Mesman
31 26 21 16 0 Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Example 3 : I-type : branch [Hennessy& Patterson] • beq rs, rt, imm16 • mem[PC] • cond = R[rs] - R[rt] • if cond = 0 • PC = PC + 4 + ext(imm16)*4 • else • PC = PC + 4 Processor Architectures and Program Mapping H. Corporaal and B. Mesman
31 26 21 16 0 Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Example 3 : I-type : branch [Hennessy& Patterson] Rd Rt RedDst Branch dc (Rt) Rs Clk ALUctr PC 5 Reg Wr 5 5 Next Address Logic BusA 32 Imm 16 16 Rw Ra Rb 32 32-bit registers Bus W 32 BusB 32 Zero Clk To Instruction Memory Imm 16 16 32 Extender ExtOp ALUSrc Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Example 3 : I-type : branch [Hennessy&Patterson] 30 30 Addr<31:2> Addr<1:0> Instruction Memory 30 “00” PC 0 30 Clk “1” 30 32 1 Imm 16 16 Instruction <31:0> 30 SignExt Branch Zero Instruction <15:0> Processor Architectures and Program Mapping H. Corporaal and B. Mesman
cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 Ifetch RF read ALU dmem RF write E.g. load 5 stages Architecture of the MIPS core • problem : long critical path • defined by the slowest instruction (load) • solution ? • = pipelining • break the instruction into smaller steps • all steps have about the same critical path Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Pipelining lw instructions [Hennessy&Patterson] cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 lw Ifetch RF read ALU dmem RF write lw Ifetch RF read ALU dmem RF write lw Ifetch RF read ALU dmem RF write • One instructions enters the pipeline every clock cycle • One instructions leaves the pipeline every clock cycle • => CPI = 1 (Cycles per Instruction) Processor Architectures and Program Mapping H. Corporaal and B. Mesman
I I I I I R R R R R A A A A A M M M M M W W W W W Pipelining lw instructions I R A M W Instructions Data Current CPU cycle Processor Architectures and Program Mapping H. Corporaal and B. Mesman
4 stages of R-type instruction [Hennessy&Patterson] cycle 1 cycle 2 cycle 3 cycle 4 Ifetch RF read ALU RF write E.g. ADD Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Resource conflict on the write port of the Rfile Pipelining lw and R-type instructions [Hennessy&Patterson] cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 lw Ifetch RF read ALU dmem RF write add Ifetch RF read ALU RF write Processor Architectures and Program Mapping H. Corporaal and B. Mesman
cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 lw Ifetch RF read ALU dmem RF write add Ifetch RF read ALU dmem RF write add Ifetch RF read ALU dmem RF write Solution: stretch R-type to 5 stages Ifetch RF read ALU dmem RF write Dummy op (noop) [Hennessy&Patterson] Processor Architectures and Program Mapping H. Corporaal and B. Mesman
mem wr Ifetch exec Reg/dec RegWr branch Next PC Rfile + 4 flags Rs BusA Ra Rt Rb BusB adr Prog mem Di Rw Data mem Dout ext. Imm16 Din Rt Rd MemtoReg [Hennessy&Patterson] MemWr RegDst ALUSrc ExtOp ALUop Processor Architectures and Program Mapping H. Corporaal and B. Mesman
DM DM DM DM DM RF RF RF RF RF IM IM IM IM IM RF RF RF RF RF Data dependencies : R-type instructions [Hennessy&Patterson] R1 = ... … = R1 + ... … = R1 + ... … = R1 + ... … = R1 + ... Processor Architectures and Program Mapping H. Corporaal and B. Mesman
DM DM DM DM DM RF RF RF RF RF IM IM IM IM IM RF RF RF RF RF Data dependencies : R-type instructions [Hennessy&Patterson] R1 = ... … = R1 + ... … = R1 + ... … = R1 + ... … = R1 + ... Solution: bypasses Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Bypasses [Hennessy&Patterson] adr Data mem Processor Architectures and Program Mapping H. Corporaal and B. Mesman
DM DM DM DM RF RF RF RF IM IM IM IM RF RF RF RF Data dependencies : load instruction [Hennessy&Patterson] R1 = lw... … = R1 + ... … = R1 + ... … = R1 + ... Processor Architectures and Program Mapping H. Corporaal and B. Mesman
DM DM DM DM RF RF RF RF IM IM IM IM RF RF RF RF Data dependencies : load instruction [Hennessy&Patterson] R1 = lw... Bypass is no solution for + instruction … = R1 + ... … = R1 - ... … = R1 - ... Processor Architectures and Program Mapping H. Corporaal and B. Mesman
DM RF IM RF Data dependencies : load instruction [Hennessy&Patterson] R1 = lw... DM RF IM … = R1 + ... RF DM RF IM … = R1 - ... RF … = R1 - ... DM RF IM RF Solution: pipeline interlock = detects a data hazard and stalls the pipeline until the hazard is cleared Processor Architectures and Program Mapping H. Corporaal and B. Mesman
x4 x3 x2 x1 x0 Z-1 Z-1 Z-1 Z-1 c4 c3 c2 c1 c0 * * * * * + y Application examples (1) Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Application examples (1) 19 instructions per tap!! Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Application examples (2) Bit level operations: finite field arithmetic 10 instructions!! Very simple in hardware Processor Architectures and Program Mapping H. Corporaal and B. Mesman
source register ($2) 27 26 25 23 22 20 srl $13, $2, 20 andi $25, $13, 1 srl $14, $2, 21 andi $24, $14, 6 or $15, $25, $24 srl $13, $2, 22 andi $14, $13, 56 or $25, $15, $14 sll $24, $25, 2 7 6 5 4 3 2 destination register ($24) Application examples (2) Bit level operations : DES example Processor Architectures and Program Mapping H. Corporaal and B. Mesman
18 17 16 13 $5 srl $24, $5, 18 srl $25, $5, 17 xor $8, $24, $25 srl $9, $5, 16 xor xor $10, $8, $9 srl $11, $5, 13 xor $12, $10, $11 andi $13, $12, 1 … 0 ... 1 $13 Application examples (2) Bit level operations : A5 example (GSM encryption) Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Application examples: conclusions • CPUs offer flexibility, but… • not efficient in performance • not efficient in code size • not efficient in power consumption Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Power Consumption in microprocessors Power consumption is (becoming) the limiting factor in processor design Solution in direction of • Hardware acceleration • Instruction Level Parallelism instead of clock speed • Code size efficiency source: ISSCC2001, Patrick Gelsinger, Intel Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Amdahl’s law • Impact of an improvement on the execution time of a program depends on 2 parameters: • f = fraction of the original computation time that is affected by the improvement • s = speedup factor (local) • exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s • speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s) • if s >> 1 then speedup_overall = 1 / ( 1 – f ) • Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56 Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Conclusions • Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw) • Keep it Simple heuristic (RISC vs. CISC) • Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set • No special features that match a high level language construct. • At least 16 registers to ease register allocation. • Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance) Processor Architectures and Program Mapping H. Corporaal and B. Mesman
real-time worst-case processing = need for more compute power • sec instr cycles sec • prog prog instr cycle CPI = 1 Programmable Digital Signal Processors • instruction level parallelism (ILP) • hardware support for loop control • attention for high level data types e.g. arrays, delaylines • (vs. scalars for CPUs) • difficult to compare architectures • e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, • shuffling, intialisation … can be included or forgotten • benchmarking (Berkeley Design Technology Inc (BDTi)) • (compare to SpecInt benchmarks for CPs) Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Outline • architectures for programmable DSPs • multiplier-accumulator • modified Harvard architecture • extension with an ALU (decision making) • controller architectures • examples: TI, Motorola, Philips • code generation • recent developments: VLIW (Very Long Instruction Word) • examples: C6 and TM Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Sum of products = basic operation for correlation, filtering, spectral analysis ... c(i) x(i) control linear expr. MPY (Booth, Wallace..) c(i) * x(i) clock P_reg PR ADDER ACR Goal = 1 cycle per iteration • position ACR (1 or 2) • adder/subtractor • extra pipelines • asymmetric inputs • multi-precision • Modifications • extra inputs/outputs Processor Architectures and Program Mapping H. Corporaal and B. Mesman
0.9 x 0.9 0.81 DSP data types • not every signal requires 32 bits • 2 types of DSP: floating point and integer • advantages FP: most specs are in FP • (conversion to int is time consuming since the behaviour • may change) • disadvantage FP: cost (area, speed, power) • wanted : type of output of an operation = type of input • (because both stored in RAM) • no problem for FP but for integer • integer multiplication doubles the number of bits: n * n => 2n • What about fractional numbers ? Processor Architectures and Program Mapping H. Corporaal and B. Mesman
DSP data types • integer and fractional numbers are a special case of fixed point • fix <p,q> (ART designer & SystemC) p q 1 0 1 -19/8 = -2.375 1 1 1 fix <8,3> 1 0 -24 23 22 2-2 21 20 2-1 2-3 Scale factor 1/8 negative weight 2’s complement quantization error Same alu handles fix <8,1>, fix <8,2>, fix <8,3>, ... if q=0 then integer e.g. int <8,0> if q=p-1 then fractional e.g. int <8,7> Processor Architectures and Program Mapping H. Corporaal and B. Mesman
DSP data types • continue (after multiplication) with msb only • represents the limit of the accuracy of the result • (can not be larger than the accuracy of the inputs) • more efficient solution • continue with msb + lsb • sum-of-product operations generate accumulative noise at 32nd • vs. 16th bit • Still overflow for addition = overflow bits • double precision accumulator • + extra overflow bits • + shift, round, truncate unit Processor Architectures and Program Mapping H. Corporaal and B. Mesman
c(i) x(i) control MPY (Booth, Wallace..) clock P_reg PR ADDER clock ACR P_reg SHIFT ROUND TRUNCATE Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Prog/data memory prog mem. data mem. prog mem. data mem. 1 data mem. 2 EXU EXU EXU Harvard Modified Harvard Von Neumann (sequencial) c(i) * x(i) Goal = 1 cycle per iteration Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Interrupt address Reset ACU_A ACU_B AR_A AR_B Stack +1 PC Program Memory DR_A DR_B IR Control Bus Rfile RAM_A RAM_B MAC Processor Architectures and Program Mapping H. Corporaal and B. Mesman
time loop 1 cycle/tap ? ci * xi filter loop i How updating the delayline ? x5 x4 x3 x2 x1 Z-1 Z-1 Z-1 Z-1 c5 c4 c3 c2 c1 * * * * * + y Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Solution 2: indirect adressing • use of a pointer to mark the begin of the delay line • update the pointer instead of moving the data • problem: trashing of the whole memory • solution: modulo addressing • need for a register to store the pointer Processor Architectures and Program Mapping H. Corporaal and B. Mesman
ACU architecture and Instruction set A S Output reg A reg S Read_A A A S Read_S S A S incA A+1 A+1 S decA A-1 A-1 S Step A+S A+S S Inc_step S+1 A S+1 Modulo 16 10 000 23 10 111 mask =hold Modulo can be implemented as a mask operation if the size is 2k output to RAM Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Addressing modes • register ADD R4, R3 R[R4] = R[R4] + R[R3] • immediate ADD R4, #3 R[R4] = R[R4] + #3 • direct ADD R4, (100) R[R4] = R[R4] + Mem[100] • indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] • w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± 1 • indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± R[R2] • Remarks • direct = for static data • indirect = for arrays • inc/dec = for stepping through arrays e.g. xn • index = for stepping through arrays e.g. x2n Processor Architectures and Program Mapping H. Corporaal and B. Mesman
Addressing modes: extra for DSP • 8 ARs (address or auxiliary register) available • extra indirect modes • circular *ARn ± % post inc/dec by 1 - circular • *ARn ± AR0 % post inc/dec by AR0 - circular • bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev. Processor Architectures and Program Mapping H. Corporaal and B. Mesman