670 likes | 806 Views
Processor Architectures and Program Mapping. 5kk10. flexibility. efficiency. DSP. Programmable CPU. Programmable DSP. Application specific instruction set processor (ASIP). Application specific processor. efficiency. ASIC. high medium low. ASIP. DSP.
E N D
Processor Architectures and Program Mapping 5kk10 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
flexibility efficiency DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
efficiency ASIC high medium low ASIP DSP GP proc FPGA low medium high flexibility Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Programmable CPU cores • introduction • architecture of the MIPS core • discussed as an example • pipelining • application examples • software issues • comparison between different CPU cores • towards application specific architectures • discussion Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Introduction • rationale: as high multiplex factor R as possible • consequence: often manual handcrafted design optimised for clock rate • problem : fast changes in the IC process technology • examples embedded: • MIPS (first one, licensing instruction set architecture) • ARM (Advanced Risc Machines, telecom, low power, • small code size, most popular one, licensing also • the micro-architecture as hard or soft IP) • Sparc • derivatives from general purpose CPUs • Intel, NEC, Hitachi, National, PowerPC Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
general purpose registers stack machines (e.g. ST20) accumulator machines register-register = load-store register-memory Introduction Instruction set architectures implicit operands explicit operands Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Introduction C = A + B Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Clk PC Instruction address Instruction Memory Instruction Rd Rt Rs Imm 5 5 5 16 32 Rw Ra Rb 32 32-bit registers Data address Data Memory 32 32 Data out Data in 32 Clk 32 Clk Architecture of the MIPS core [Hennessy& Patterson] Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
31 26 21 16 11 6 0 Op rs rt rd shamt funct R - type 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 31 26 21 16 0 Op rs rt immediate I - type 6 bits 5 bits 5 bits 16 bits 31 26 0 Op target address J - type 6 bits 26 bits MIPS instruction formats ( 32 bits ) [Hennessy& Patterson] op operation of the instruction rs,rt,rd source and destination registers shamt shift amount funct operation of the instruction-part 2 imm for program constants addr target address of a jump Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
31 26 21 16 11 6 0 Op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits • add rd, rs, rt • mem[PC] • R[rd] = R[rs] + R[rt] • PC = PC + 4 Rd Rt Rs 5 Reg Wr 5 5 ALUctr BusA 32 Rw Ra Rb 32 32-bit registers Bus W Result 32 32 BusB 32 Clk Example 1 : R - type : add instruction [Hennessy& Patterson] Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Critical path R-type operation Clk PC [Hennessy& Patterson] Instruction address Instruction Memory Instruction Rd Rt Rs Imm 5 5 5 16 32 Rw Ra Rb 32 32-bit registers Data address Data Memory 32 32 Data out Data in Clk 32 Clk Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Critical path R-type operation Clock Clock-to-Q PC New value Old value Instruction memory access time Rs, rt, rd op, funct Old value New value RFile access time Bus A,B Old value New value ALU delay Bus W Old value New value Set up + skew Write into RFile Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
31 26 21 16 0 Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Rd Rt RedDst dc (Rt) Rs 5 Reg Wr 5 5 ALUctr MemtoReg BusA 32 Rw Ra Rb 32 32-bit registers Bus W Result 32 32 MemWr BusB 32 Clk WrEn Adr Data Memory Data In 32 Imm 16 16 32 Extender Clk ExtOp ALUSrc Example 2 : I-type : load word [Hennessy& Patterson] • lw rs, rt, imm16 • mem[PC] • addr = R[rs] + ext[imm16] • R[rt] = mem[addr] • PC = PC + 4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Critical path load operation Clock Clock-to-Q PC Old value New value Instruction memory access time Rs, rt, rd op, funct Old value New value RFile access time Bus A,B Old value New value ALU delay address Old value New value Mem access time Bus W Old value New value set up+skew Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
31 26 21 16 0 Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Example 3 : I-type : branch [Hennessy& Patterson] • beq rs, rt, imm16 • mem[PC] • cond = R[rs] - R[rt] • if cond = 0 • PC = PC + 4 + ext(imm16)*4 • else • PC = PC + 4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
31 26 21 16 0 Op rs rt immediate 6 bits 5 bits 5 bits 16 bits Example 3 : I-type : branch [Hennessy& Patterson] Rd Rt RedDst Branch dc (Rt) Rs Clk ALUctr PC 5 Reg Wr 5 5 Next Address Logic BusA 32 Imm 16 16 Rw Ra Rb 32 32-bit registers Bus W 32 BusB 32 Zero Clk To Instruction Memory Imm 16 16 32 Extender ExtOp ALUSrc Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Example 3 : I-type : branch [Hennessy&Patterson] 30 30 Addr<31:2> Addr<1:0> Instruction Memory 30 “00” PC 0 30 Clk “1” 30 32 1 Imm 16 16 Instruction <31:0> 30 SignExt Branch Zero Instruction <15:0> Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Example 3 : I-type : branch [Hennessy&Patterson] 30 30 Addr<31:2> Addr<1:0> Instruction Memory PC “1” c_in 00 Clk 0 “0” 32 30 Imm 16 16 SignExt 1 Instruction <15:0> Instruction <31:0> Branch Zero Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 Ifetch RF read ALU dmem RF write E.g. load 5 stages Architecture of the MIPS core • problem : long critical path • defined by the slowest instruction (load) • solution ? • = pipelining • break the instruction into smaller steps • all steps have about the same critical path Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Pipelining lw instructions [Hennessy&Patterson] cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 lw Ifetch RF read ALU dmem RF write lw Ifetch RF read ALU dmem RF write lw Ifetch RF read ALU dmem RF write • One instructions enters the pipeline every clock cycle • One instructions leaves the pipeline every clock cycle • => CPI = 1 (Cycles per Instruction) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
I I I I I R R R R R A A A A A M M M M M W W W W W Pipelining lw instructions I R A M W Instructions Data Current CPU cycle Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
4 stages of R-type instruction [Hennessy&Patterson] cycle 1 cycle 2 cycle 3 cycle 4 Ifetch RF read ALU RF write E.g. ADD Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Resource conflict on the write port of the Rfile Pipelining lw and R-type instructions [Hennessy&Patterson] cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 lw Ifetch RF read ALU dmem RF write add Ifetch RF read ALU RF write Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 lw Ifetch RF read ALU dmem RF write add Ifetch RF read ALU dmem RF write add Ifetch RF read ALU dmem RF write Solution: stretch R-type to 5 stages Ifetch RF read ALU dmem RF write Dummy op (noop) [Hennessy&Patterson] Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
mem wr Ifetch exec Reg/dec RegWr branch Next PC Rfile + 4 flags Rs BusA Ra Rt Rb BusB adr Prog mem Di Rw Data mem Dout ext. Imm16 Din Rt Rd MemtoReg [Hennessy&Patterson] MemWr RegDst ALUSrc ExtOp ALUop Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
DM DM DM DM DM RF RF RF RF RF IM IM IM IM IM RF RF RF RF RF Data dependencies : R-type instructions [Hennessy&Patterson] R1 = ... … = R1 + ... … = R1 + ... … = R1 + ... … = R1 + ... Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
DM DM DM DM DM RF RF RF RF RF IM IM IM IM IM RF RF RF RF RF Data dependencies : R-type instructions [Hennessy&Patterson] R1 = ... … = R1 + ... … = R1 + ... … = R1 + ... … = R1 + ... Solution: bypasses Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Bypasses [Hennessy&Patterson] adr Data mem Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
DM DM DM DM RF RF RF RF IM IM IM IM RF RF RF RF Data dependencies : load instruction [Hennessy&Patterson] R1 = lw... … = R1 + ... … = R1 + ... … = R1 + ... Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
DM DM DM DM RF RF RF RF IM IM IM IM RF RF RF RF Data dependencies : load instruction [Hennessy&Patterson] R1 = lw... Bypass is no solution for + instruction … = R1 + ... … = R1 - ... … = R1 - ... Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
DM RF IM RF Data dependencies : load instruction [Hennessy&Patterson] R1 = lw... DM RF IM … = R1 + ... RF DM RF IM … = R1 - ... RF … = R1 - ... DM RF IM RF Solution: pipeline interlock = detects a data hazard and stalls the pipeline until the hazard is cleared Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
I R A M W Instructions i1) lw r10, r2, r0 i2) add r8, r9, r10 i1 Data available from data cache i2 I R(interlocked) A M W Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
I R A M W Instructions i1) MULT r3, r2, r1 i2) ADD r5, r4, r3 i1 i2 I R(interlocked) A M W Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
I I I I I R R R R R A A A A A M M M M M W W W W W Control hazards branch Next PC Rfile + 4 flags Rs BusA Ra Rt Rb BusB adr Prog mem Di Rw Data mem Dout ext. Imm16 Din Rt Rd [Hennessy&Patterson] Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
I I I R R R A A A M M M W W W Control hazards branch Next PC 0? + 4 flags Rs Ra BusA Rt Rfile Rb BusB adr Prog mem Di Rw Data mem Dout ext. Imm16 Din Rt Rd [Hennessy&Patterson] Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
I I I R R R A A A M M M W W W Control hazards i1) beq r10, r2, 1b i2) nop/independent instructions i3) add r8, r9, r10 i1 i2 Address available for instr. fetch i3 Solution: compiler action possibly filling the branch delay slot Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
8K I$ Itag PIO PR3930 IU (including MAD) MMU DSU dtag 4K D$ PR3930 CPU Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
TCP chip: TV controller • PR3930 + peripherals • Gfx, SDRAM controller, • Serial interconnect bus, • I2C, UART, timers • PI bus architecture • 80 mm2 • 352 pins • 0.35 micron process • 48 MHz (96 for gfx) D$ I$ Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Programmable CPU cores • introduction • architecture of the MIPS core • discussed as an example • pipelining • application examples • software issues • comparison between different CPU cores • towards application specific architectures • discussion Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
x4 x3 x2 x1 x0 Z-1 Z-1 Z-1 Z-1 c4 c3 c2 c1 c0 * * * * * + y Application examples (1) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Application examples (1) 19 instructions per tap!! Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Application examples (2) Bit level operations: finite field arithmetic 10 instructions!! Very simple in hardware Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
source register ($2) 27 26 25 23 22 20 srl $13, $2, 20 andi $25, $13, 1 srl $14, $2, 21 andi $24, $14, 6 or $15, $25, $24 srl $13, $2, 22 andi $14, $13, 56 or $25, $15, $14 sll $24, $25, 2 7 6 5 4 3 2 destination register ($24) Application examples (2) Bit level operations : DES example Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
18 17 16 13 $5 srl $24, $5, 18 srl $25, $5, 17 xor $8, $24, $25 srl $9, $5, 16 xor xor $10, $8, $9 srl $11, $5, 13 xor $12, $10, $11 andi $13, $12, 1 … 0 ... 1 $13 Application examples (2) Bit level operations : A5 example (GSM encryption) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Application examples (3) Video conferencing H263 CIF format = 352 * 288 px, 2:1:1, 8 bits/sample QCIF = 1/4 CIF SQCIF = 96*128 Process = 0.25 micron power consumption = 100 mW @ 10 Hz 96*128*1.5*10Hz = 180 KB/s :72 20Kb/s Compare 852*576*2B/p *50 =49MB/s Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
out VLC in DCT Q IQ IDCT + + - + best match Motion estimation Frame store Motion comp motion vectors Application examples (3) H.263 video encoder Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Application examples (3) PR3940 I$ D$ memory 10 Hz => 140 MHz CPU Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Application examples (3) In which process can the H263 video encoder be executed on a single MIPS processor ? Conclude: power consumption is limiting factor!! Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Application examples: conclusions • CPUs offer flexibility, but… • not efficient in performance • not efficient in code size • not efficient in power consumption Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
func() { a=x.value & 0x3; if (a != 0) { b = a * c + d; } else { b = … ; } y.post(b); } compile each BB to instructions a=x.value & 0x3; BB1 a == 0 a != 0 parser b = a * c + d; b = … ; BB2 BB3 ldi #0x3, R5 and R4,R5,R6 cmp R0,R6,R7 br R7,true ba false y.post(b); BB4 func() { a=x.value & 0x3; DelayCycles(7); if (a != 0) { b = a * c + d; DelayCycles(8); } else { b = … ; DelayCycles(5); } y.post(b); DelayCycles(4); } compile and run generate new C with delay counts Arch. Model ldi=2 cycles nop =1 cycle ... Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman