1 / 65

Embedded Processor Architecture

Embedded Processor Architecture. Bart Mesman Henk Corporaal 5kk73 2010. flexibility. efficiency. DSP. Programmable CPU. Programmable DSP. Application specific instruction set processor (ASIP). Application specific processor. efficiency. ASIC. high medium

elda
Download Presentation

Embedded Processor Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Embedded Processor Architecture Bart Mesman Henk Corporaal 5kk73 2010

  2. flexibility efficiency DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  3. efficiency ASIC high medium low ASIP DSP GP proc FPGA low medium high flexibility Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  4. Programmable CPU cores • introduction • architecture of the MIPS core • discussed as an example • pipelining • application examples • software issues • comparison between different CPU cores • towards application specific architectures • discussion Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  5. Introduction • rationale: General-purpose -> large market • consequence: often handcrafted design optimised for clock rate • problem : fast changes in the IC process technology • examples embedded: • MIPS (first one, licensing instruction set architecture) • ARM (Advanced Risc Machines, telecom, low power, • small code size, most popular one, licensing also • the micro-architecture as hard or soft IP) • derivatives from general purpose CPUs • Intel, NEC, Hitachi, National, PowerPC Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  6. general purpose registers stack machines (e.g. ST20) accumulator machines register-register = load-store register-memory Introduction Instruction set architectures implicit operands explicit operands Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  7. Clk PC Instruction address Instruction Memory Instruction Rd Rt Rs Imm 5 5 5 16 32 Rw Ra Rb 32 32-bit registers Data address Data Memory 32 32 Data out Data in 32 Clk 32 Clk Architecture of the MIPS core [Hennessy& Patterson] Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  8. 31 26 21 16 11 6 0 Op rs rt rd shamt funct R - type 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 31 26 21 16 0 Op rs rt immediate I - type 6 bits 5 bits 5 bits 16 bits 31 26 0 Op target address J - type 6 bits 26 bits MIPS instruction formats ( 32 bits ) [Hennessy& Patterson] op operation of the instruction rs,rt,rd source and destination registers shamt shift amount funct operation of the instruction-part 2 imm for program constants addr target address of a jump Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  9. 31 26 21 16 11 6 0 Op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits • add rd, rs, rt • mem[PC] • R[rd] = R[rs] + R[rt] • PC = PC + 4 Rd Rt Rs 5 Reg Wr 5 5 ALUctr BusA 32 Rw Ra Rb 32 32-bit registers Bus W Result 32 32 BusB 32 Clk Example 1 : R - type : add instruction [Hennessy& Patterson] Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  10. Critical path R-type operation Clk PC [Hennessy& Patterson] Instruction address Instruction Memory Instruction Rd Rt Rs Imm 5 5 5 16 32 Rw Ra Rb 32 32-bit registers Data address Data Memory 32 32 Data out Data in Clk 32 Clk Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  11. 31 26 21 16 0 Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Rd Rt RedDst dc (Rt) Rs 5 Reg Wr 5 5 ALUctr MemtoReg BusA 32 Rw Ra Rb 32 32-bit registers Bus W Result 32 32 MemWr BusB 32 Clk WrEn Adr Data Memory Data In 32 Imm 16 16 32 Extender Clk ExtOp ALUSrc Example 2 : I-type : load word [Hennessy& Patterson] • lw rs, rt, imm16 • mem[PC] • addr = R[rs] + ext[imm16] • R[rt] = mem[addr] • PC = PC + 4 Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  12. 31 26 21 16 0 Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Example 3 : I-type : branch [Hennessy& Patterson] • beq rs, rt, imm16 • mem[PC] • cond = R[rs] - R[rt] • if cond = 0 • PC = PC + 4 + ext(imm16)*4 • else • PC = PC + 4 Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  13. 31 26 21 16 0 Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Example 3 : I-type : branch [Hennessy& Patterson] Rd Rt RedDst Branch dc (Rt) Rs Clk ALUctr PC 5 Reg Wr 5 5 Next Address Logic BusA 32 Imm 16 16 Rw Ra Rb 32 32-bit registers Bus W 32 BusB 32 Zero Clk To Instruction Memory Imm 16 16 32 Extender ExtOp ALUSrc Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  14. Example 3 : I-type : branch [Hennessy&Patterson] 30 30 Addr<31:2> Addr<1:0> Instruction Memory 30 “00” PC 0 30 Clk “1” 30 32 1 Imm 16 16 Instruction <31:0> 30 SignExt Branch Zero Instruction <15:0> Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  15. cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 Ifetch RF read ALU dmem RF write E.g. load 5 stages Architecture of the MIPS core • problem : long critical path • defined by the slowest instruction (load) • solution ? • = pipelining • break the instruction into smaller steps • all steps have about the same critical path Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  16. Pipelining lw instructions [Hennessy&Patterson] cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 lw Ifetch RF read ALU dmem RF write lw Ifetch RF read ALU dmem RF write lw Ifetch RF read ALU dmem RF write • One instructions enters the pipeline every clock cycle • One instructions leaves the pipeline every clock cycle • => CPI = 1 (Cycles per Instruction) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  17. I I I I I R R R R R A A A A A M M M M M W W W W W Pipelining lw instructions I R A M W Instructions Data Current CPU cycle Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  18. 4 stages of R-type instruction [Hennessy&Patterson] cycle 1 cycle 2 cycle 3 cycle 4 Ifetch RF read ALU RF write E.g. ADD Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  19. Resource conflict on the write port of the Rfile Pipelining lw and R-type instructions [Hennessy&Patterson] cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 lw Ifetch RF read ALU dmem RF write add Ifetch RF read ALU RF write Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  20. cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 lw Ifetch RF read ALU dmem RF write add Ifetch RF read ALU dmem RF write add Ifetch RF read ALU dmem RF write Solution: stretch R-type to 5 stages Ifetch RF read ALU dmem RF write Dummy op (noop) [Hennessy&Patterson] Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  21. mem wr Ifetch exec Reg/dec RegWr branch Next PC Rfile + 4 flags Rs BusA Ra Rt Rb BusB adr Prog mem Di Rw Data mem Dout ext. Imm16 Din Rt Rd MemtoReg [Hennessy&Patterson] MemWr RegDst ALUSrc ExtOp ALUop Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  22. DM DM DM DM DM RF RF RF RF RF IM IM IM IM IM RF RF RF RF RF Data dependencies : R-type instructions [Hennessy&Patterson] R1 = ... … = R1 + ... … = R1 + ... … = R1 + ... … = R1 + ... Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  23. DM DM DM DM DM RF RF RF RF RF IM IM IM IM IM RF RF RF RF RF Data dependencies : R-type instructions [Hennessy&Patterson] R1 = ... … = R1 + ... … = R1 + ... … = R1 + ... … = R1 + ... Solution: bypasses Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  24. Bypasses [Hennessy&Patterson] adr Data mem Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  25. DM DM DM DM RF RF RF RF IM IM IM IM RF RF RF RF Data dependencies : load instruction [Hennessy&Patterson] R1 = lw... … = R1 + ... … = R1 + ... … = R1 + ... Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  26. DM DM DM DM RF RF RF RF IM IM IM IM RF RF RF RF Data dependencies : load instruction [Hennessy&Patterson] R1 = lw... Bypass is no solution for + instruction … = R1 + ... … = R1 - ... … = R1 - ... Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  27. DM RF IM RF Data dependencies : load instruction [Hennessy&Patterson] R1 = lw... DM RF IM … = R1 + ... RF DM RF IM … = R1 - ... RF … = R1 - ... DM RF IM RF Solution: pipeline interlock = detects a data hazard and stalls the pipeline until the hazard is cleared Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  28. x4 x3 x2 x1 x0 Z-1 Z-1 Z-1 Z-1 c4 c3 c2 c1 c0 * * * * * + y Application examples (1) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  29. Application examples (1) 19 instructions per tap!! Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  30. Application examples (2) Bit level operations: finite field arithmetic 10 instructions!! Very simple in hardware Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  31. source register ($2) 27 26 25 23 22 20 srl $13, $2, 20 andi $25, $13, 1 srl $14, $2, 21 andi $24, $14, 6 or $15, $25, $24 srl $13, $2, 22 andi $14, $13, 56 or $25, $15, $14 sll $24, $25, 2 7 6 5 4 3 2 destination register ($24) Application examples (2) Bit level operations : DES example Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  32. 18 17 16 13 $5 srl $24, $5, 18 srl $25, $5, 17 xor $8, $24, $25 srl $9, $5, 16 xor xor $10, $8, $9 srl $11, $5, 13 xor $12, $10, $11 andi $13, $12, 1 … 0 ... 1 $13 Application examples (2) Bit level operations : A5 example (GSM encryption) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  33. Application examples: conclusions • CPUs offer flexibility, but… • not efficient in performance • not efficient in code size • not efficient in power consumption Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  34. Power Consumption in microprocessors Power consumption is (becoming) the limiting factor in processor design Solution in direction of • Hardware acceleration • Instruction Level Parallelism instead of clock speed • Code size efficiency source: ISSCC2001, Patrick Gelsinger, Intel Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  35. Amdahl’s law • Impact of an improvement on the execution time of a program depends on 2 parameters: • f = fraction of the original computation time that is affected by the improvement • s = speedup factor (local) • exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s • speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s) • if s >> 1 then speedup_overall = 1 / ( 1 – f ) • Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56 Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  36. Conclusions • Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw) • Keep it Simple heuristic (RISC vs. CISC) • Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set • No special features that match a high level language construct. • At least 16 registers to ease register allocation. • Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  37. real-time worst-case processing = need for more compute power • sec instr cycles sec • prog prog instr cycle CPI = 1 Programmable Digital Signal Processors • instruction level parallelism (ILP) • hardware support for loop control • attention for high level data types e.g. arrays, delaylines • (vs. scalars for CPUs) • difficult to compare architectures • e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, • shuffling, intialisation … can be included or forgotten • benchmarking (Berkeley Design Technology Inc (BDTi)) • (compare to SpecInt benchmarks for CPs) Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  38. Outline • architectures for programmable DSPs • multiplier-accumulator • modified Harvard architecture • extension with an ALU (decision making) • controller architectures • examples: TI, Motorola, Philips • code generation • recent developments: VLIW (Very Long Instruction Word) • examples: C6 and TM Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  39. Sum of products = basic operation for correlation, filtering, spectral analysis ... c(i) x(i) control linear expr. MPY (Booth, Wallace..)  c(i) * x(i) clock P_reg PR ADDER ACR Goal = 1 cycle per iteration • position ACR (1 or 2) • adder/subtractor • extra pipelines • asymmetric inputs • multi-precision • Modifications • extra inputs/outputs Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  40. 0.9 x 0.9 0.81 DSP data types • not every signal requires 32 bits • 2 types of DSP: floating point and integer • advantages FP: most specs are in FP • (conversion to int is time consuming since the behaviour • may change) • disadvantage FP: cost (area, speed, power) • wanted : type of output of an operation = type of input • (because both stored in RAM) • no problem for FP but for integer • integer multiplication doubles the number of bits: n * n => 2n • What about fractional numbers ? Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  41. DSP data types • integer and fractional numbers are a special case of fixed point • fix <p,q> (ART designer & SystemC) p q 1 0 1 -19/8 = -2.375 1 1 1 fix <8,3> 1 0 -24 23 22 2-2 21 20 2-1 2-3 Scale factor 1/8 negative weight 2’s complement quantization error Same alu handles fix <8,1>, fix <8,2>, fix <8,3>, ... if q=0 then integer e.g. int <8,0> if q=p-1 then fractional e.g. int <8,7> Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  42. DSP data types • continue (after multiplication) with msb only • represents the limit of the accuracy of the result • (can not be larger than the accuracy of the inputs) • more efficient solution • continue with msb + lsb • sum-of-product operations generate accumulative noise at 32nd • vs. 16th bit • Still overflow for addition = overflow bits • double precision accumulator • + extra overflow bits • + shift, round, truncate unit Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  43. c(i) x(i) control MPY (Booth, Wallace..) clock P_reg PR ADDER clock ACR P_reg SHIFT ROUND TRUNCATE Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  44. Prog/data memory prog mem. data mem. prog mem. data mem. 1 data mem. 2 EXU EXU EXU Harvard Modified Harvard Von Neumann (sequencial)  c(i) * x(i) Goal = 1 cycle per iteration Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  45. Interrupt address Reset ACU_A ACU_B AR_A AR_B Stack +1 PC Program Memory DR_A DR_B IR Control Bus Rfile RAM_A RAM_B MAC Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  46. time loop 1 cycle/tap ?  ci * xi filter loop i How updating the delayline ? x5 x4 x3 x2 x1 Z-1 Z-1 Z-1 Z-1 c5 c4 c3 c2 c1 * * * * * + y Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  47. Solution 2: indirect adressing • use of a pointer to mark the begin of the delay line • update the pointer instead of moving the data • problem: trashing of the whole memory • solution: modulo addressing • need for a register to store the pointer Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  48. ACU architecture and Instruction set A S Output reg A reg S Read_A A A S Read_S S A S incA A+1 A+1 S decA A-1 A-1 S Step A+S A+S S Inc_step S+1 A S+1 Modulo 16 10 000 23 10 111 mask =hold Modulo can be implemented as a mask operation if the size is 2k output to RAM Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  49. Addressing modes • register ADD R4, R3 R[R4] = R[R4] + R[R3] • immediate ADD R4, #3 R[R4] = R[R4] + #3 • direct ADD R4, (100) R[R4] = R[R4] + Mem[100] • indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]] • w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± 1 • indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] • R[R3] = R[R3] ± R[R2] • Remarks • direct = for static data • indirect = for arrays • inc/dec = for stepping through arrays e.g.  xn • index = for stepping through arrays e.g.  x2n Processor Architectures and Program Mapping H. Corporaal and B. Mesman

  50. Addressing modes: extra for DSP • 8 ARs (address or auxiliary register) available • extra indirect modes • circular *ARn ± % post inc/dec by 1 - circular • *ARn ± AR0 % post inc/dec by AR0 - circular • bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev. Processor Architectures and Program Mapping H. Corporaal and B. Mesman

More Related