Structure of Computer Systems

Structure of Computer Systems Course 4 The Central Processing Unit - CPU

CPU - Central Processing Unit • “Classic (idyllic) view” • Incorporates 2 of the 5 components of the von Neumann’s classical model: • ALU • CU – Control Unit • It is the brain (intelligent part) of a computer • Fetch (read) instruction, decode/interpret it, read data, execute instruction and store the result • Do its job in a synchronized and sequential way – “one thing at a time”

CPU - Central Processing Unit • Today’s view: • Contains all kind of computer components: • Multiple CPUs: • symmetric, asymmetric, • multiple cores, • multiple ALUs, specialized ALUs (e.g. floating point, multimedia – MMX, SSE2) • Memory – multiple levels of cache memory (L0, L1, L2, Trace cache) • Interfaces and Peripheral devices – (in case of microcontrollers and DSPs) • Serial channels • Parallel interfaces, • Timers, counters • Converters (ADC, DAC) • Network interfaces • Interrupt system • Bus controller(s) and arbiter(s) • Memory management units • Execute instructions in parallel and in a speculative order • Intelligence may be distributed in memories and interfaces as well • Where is that nice idyllic image ?

CG Clk Rst IR_ld IR Memory PhG Data Addr MUX Data in Dec&CC wr PC … Sel Op_sel ALU Control signals Rst Acc_ld Acc_shr Acc Inc PC_ld Acc_shl Acc_clr Starting with the beginning … • A simple computer • Attributes: sequential, one (accumulator) register, one memory for instructions and data Legend CG - clock generator PhG – phase generator PC – program counter IR – instruction register Acc - accumulator

A simple computer • How does it work? • 4 phases: • IF – instruction fetch – read the instruction into IR • Dec - Decode the instruction – generate control signals • PreEx - Prepare execution – e.g. read the data from memory • Exe – Execute – e.g. adding, subtraction

CG Clk Rst IR_ld IR Memory PhG Data Addr MUX Data in Dec&CC wr PC … Sel Op_sel ALU Control signals Rst Acc_ld Acc_shr Acc Inc PC_ld Acc_shl Acc_clr A simple computer • Example 1 – ADD Acc, M[100h] • IF : Sel=0 => Address = PC ; IR_ld – impuls => IR = ADD 100 • Dec: Sel=0 =>Address = IR_adr[100] ; Inc=1 increment PC • PreEx: Op_sel = code_add => ALU is doing an adding • Exe: Acc_ld => Acc = Acc +M[100]

CG Clk Rst IR_ld IR Memory PhG Data Addr MUX Data in Dec&CC wr PC … Sel Op_sel ALU Control signals Rst Acc_ld Acc_shr Acc Inc PC_ld Acc_shl Acc_clr A simple computer • Example 2 – JMP 200h • IF : Sel=0 => Address = PC ; IR_ld – impulse => IR=JMP 200 • Dec: Inc = 1 => increment PC • PreEx: PC_ld = 1 => PC=IR_addr=100 • Exe: • Example 3 – SHR Acc • IF and Dec: the same • PreEx: • Exe: Acc_shr = 1 => shift the accumulator one position to the right

A simple computer • Homework: try to implement: • MOV M[addr], Acc • MOV Acc, M[addr] • Conditional jump (e.g if Acc=0, >0, <0) • MOV Acc, 0

A simple computer • Issues: • Every instruction executed in a fixed (4) number of steps • Too many for simple instructions • Too few for complex instructions (e.g. multiply) • Only one internal register – hard to operate with data • No Input and Output devices • Limited number of possible operations – small instruction set • Possible improvements: • Variable number of phases -> the phase generator should depend on the instruction code • Multiple internal registers -> 2 buses: input data; output data • Front panel with 7segment LEDs and switches • Increase the number of instructions -> more complex Decoder and Command and Control Unit

A more sophisticated computer, but still simple – the MIPS architecture • Attributes: • Sequential • 32 internal registers of 16 bits • Instructions: fixed length, variable content • Harvard memory architecture: separate instruction and data memory • An instruction is executed in 5 phases: • IF – instruction fetch • ID – decode the instruction and prepare (read) the data • Ex – execute the instruction • M - operation with the memory • Wb – write back – store the result • Instruction types: • “R” Register ex. ADD $RS, $RD,$RT • “I” Immediate ex. ADDI $RT,$RS, constant; LW $RT, offset($RS) • “J” Jump ex. JMP target

Opcode rs rt rd shift funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits MIPS architecture • Instruction formats: • Fixed length (4 bytes) but multiple content • “R” – register type instructions <instr> rd, rs, rt • rd –destination register • rs – source register • rt – target register • Ex: add $s1, $s2, $s3 ; $s1=$s2+$s3

Opcode Target address 6 bits 26 bits Opcode rs rt IMM/Addr 6 bits 5 bits 5 bits 16 bits MIPS architecture Instruction formats • “I” immediate type instruction - with immediate value (constant) <instr> rt, rs, IMM • rs – source register • rt – target register • Ex: addi $s1, $s2, 55 ; $s1=$s2+55 • “J” – jump type instructions <instr> LABEL • Ex: j et1 ;jump

MIPS architecture • Address generation and instruction fetch PC_MUX_Sel1 PC_ld IR_ld +4 Op_code MUX Program Memory PC Address Instr. code IR op_address Add 0 MUX const. Jump address PC_MUX_Sel2 PC = PC+4 - increment the PC PC=Jump_Address – absolute jump PC=PC+ Jump_Address – relative jump

MIPS architecture Exec cmds. DEC op_code Mem. cmds. • Decode and data preparation WB cmds. Instruction register reg. 0 MUX A (data) reg. 1 reg. 2 IR op1_ad reg. 31 op2_ad MUX B (data) Register Block address I (Immediate value)

Result A Address ALU B Data Memory Dout I Din Sel_ALU ex_op_code Wr_mem Exe and mem cmds MIPS architecture • Execute and memorize Data out

Result reg. 0 reg. 1 MUX reg. 2 Data out reg. 31 IR Dest. reg DEC Wr_R0,31 Wr_reg Register Block Sel_rez WB cmds MIPS architecture • Write back the result

MIPS architecture • The whole picture Clk Clock gen. Phase gen. Instr. dec +4 IR PC Instr. mem Regs Data Mem Regs ALU 0

Pipeline execution • What does it mean? • Work as “an assembly line” • idea – General Motors around 1900 • How to do it? • Specialized components (units) for every phase of instruction execution • Memorize the partial results in temporary buffers • What can we achieve? • Higher execution speed at the same clock frequency • CPI ~ 1

IF ID Ex M Wb IF ID Ex M Wb IF ID Ex M Wb Instr. 1 Instr. 2 Instr. 3 Sequential v.s. Pipeline execution • Sequential execution CPI=5 • Pipeline execution CPI=1 (in the ideal case) T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 i1 IF ID Ex M Wb IF ID Ex M Wb i2 i3 IF ID Ex M Wb i4 IF ID Ex M Wb i5 IF ID Ex M Wb

Superscalare and superpipeline architectures • Superscalar – • Multiple pipelines • 2 instructions are fetched every clock • CPI= ½ • Superpipeline • phases require only half clock period • CPI = 1/2 T1 T2 T3 T4 T5 T6 instr. i IF ID Ex M Wb instr. i+1 IF ID Ex M Wb instr. i+2 IF ID Ex M Wb instr. i+3 IF ID Ex M Wb T1 T2 T3 T4 T5 T6 instr. i IF ID Ex M Wb instr. i+1 IF ID Ex M Wb instr. i+2 IF ID Ex M Wb instr. i+3 IF ID Ex M Wb

A Data Mem Inst. Mem Reg. block I R Reg. block R M addr Do addr inst. B Di D I P C ex m wb D e c +4 m wb wb C2 C1 C3 IF DI Ex M Wb Pipelined MIPS architecture

Pipeline architecture • There is no free meal! • Hazard cases: • Data hazard • Data dependency between consecutive instructions • Control hazard • Jump/branch instructions change the normal (sequential) order of instruction execution • Structural hazard • Instructions in different phases use the same structural component (e.g. ALU, registers, memory, bus, etc.) • Result: reduce the speed and the efficiency of the pipeline architecture

IF ID Ex M WbMOV AX, 5 IF ID Ex M ADD BX, AX Stall phases IF ID Ex MSUB CX, 5 IFID ExM MOV DX, CX Hazard cases in pipeline architectures • Data hazard • Data hazard types: • RAW - read after write • Occurs very often; avoided through forwarding (see Common data bus) • WAR – write after read • It is rare in classic pipeline; more often in superscalar pipelines • WAW – write after write • RAR – not a hazard

Hazard cases in pipeline architectures • Data hazard (cont.) • Solutions: • Detection and Stall phases • instruction with unsolved data dependency waits in the “instruction fetch” stage until the data is available • the next instructions are also stalled • Register renaming • multiple copies of a register (see alias registers for Pentium Pro) • instructions with no logical dependency between them can get different copies of the same register • avoid artificial data dependency caused by the limited number of internal registers • Forwarding (see Common data bus) • transfer a result in advance before it is written in the final place (register or memory location) • Out-of-order execution • speculative execution (see Pentium Pro architecture)

IF ID Ex M Wb Instruction with no memory phase IF ID Ex Wb IF ID Ex M Wb IF ID Ex M Wb Two instr. are using the register block in different phases Hazard cases in pipeline architectures • Structural hazard • Solutions: • Detection and Stall phases • Redundant functional units – see Pentium processors • Harvard memory organization – separate code and data memory – see microcontrollers • Multiple buses – see DSPs • Out-of-order execution

JE et1IF ID Ex ADD AX, BX IF IDEx M SUB CX, DX IF ID Ex M ............... et1: MOV SI, 1234h IF ID Ex M Wb Hazard cases in pipeline architectures • Control hazard • Solutions: • Stall phases • Branch prediction • Out-of-order execution

Pipeline architecture – hazard cases • Solving hazard cases: • Detect hazard cases and introduce “stall” phases • Rearrange instructions: • re-arrange instructions in order to reduce the dependences between consecutive instructions • Methods: • Static scheduling – made before program execution – optimization made by the compiler or user • Dynamic scheduling – made during program execution – optimization made by the processor – out-of-order execution • Branch prediction techniques

Static v.s. dynamic scheduling • Static scheduling: • The optimal order of instructions is established by the compiler, based on information about the structure of the pipeline • Advantages: it is made once and benefit every time the code is executed • Drawback: compiler should know about the structure of the hardware (e.g. pipeline stages, phases of every instruction); compiler must be changed when the processor version changes • Dynamic scheduling: • The hardware has the capacity to reorder instruction to avoid or reduce the effect of hazard cases • Advantage: the processor knows best its structure; optimization can be better connected to the hardware; some dependences are reviled on at run-time • Drawbacks: reordering decisions are made every time the code is executed; mode complex hardware is needed

Structure of Computer Systems