Computer Systems

Computer Systems The processor architecture Computer Systems – the processor architecture

Basic Knowledge • Relative timing of the elements is important Computer Systems – the processor architecture

%eax %esi %ecx %edi %edx %esp %ebx %ebp Programmers visible state Program registers Memory CC PC Von Neumann architecture, both instructions and data in memory Computer Systems – the processor architecture

Memory invisible to user code 0xffffffff Kernel virtual memory 0xc0000000 User stack (created at runtime) Memory mapped region for shared libraries printf() function 0x40000000 Run-time heap (created at runtime by malloc) Read/write data Loaded from the hello executable file Read-only code and data 0x08048000 Unused 0 Program counter • The program counter holds the address of the instruction currently executed • The next instruction has to be collected from memory (slow!) or PC Computer Systems – the processor architecture

Processing a single instruction • Fetch • Read the instruction (1-5 bytes) from memory • Decode • Reads the values from the registers • Execute • Perform a arithmetic/logic operation OR Test the jump conditions • Memory • Read/Write to memory • Write back • Update the registers • PC update • Set the address of the next instruction Computer Systems – the processor architecture

A B Register file M E Seq. architecture PC Write back Data memory • Hardware connected with named wires(word & bytes, byte & bits, bit) Memory CC ALU Execute icode ifun rA rB valC valP Need valC PC increment Instr valid Need regids Decode Split Align Bytes 1-5 Byte 0 Instruction memory Instruction memory PC increment Fetch PC PC Computer Systems – the processor architecture

OPl rA, rB Fetch icode:ifun  M1[PC] Read instruction byte rA:rB  M1[PC+1] Read register byte valP  PC+2 Compute next PC Decode valA  R[rA] Read operand A valB  R[rB] Read operand B Execute valE  valB ifun valA Perform ALU operation Set CC Set condition code register Memory Write back R[rB]  valE Write back result PC update PC  valP Update PC Stage Computation: ALU Operation • Formulate instruction execution as sequence of simple steps • Use same general form for all instructions Computer Systems – the processor architecture

call Dest Fetch icode:ifun  M1[PC] Read instruction byte valC  M4[PC+1] Read destination address valP  PC+5 Compute return point Decode valB  R[%esp] Read stack pointer Execute valE  valB + –4 Decrement stack pointer Memory M4[valE]  valP Write return value on stack Write back R[%esp]  valE Update stack pointer PC update PC  valC Set PC to destination Stage Computation: procedure call • Use ALU to decrement stack pointer • Store incremented PC Computer Systems – the processor architecture

jXX Dest Fetch icode:ifun  M1[PC] Read instruction byte valC  M4[PC+1] Read destination address valP  PC+5 Fall through address Decode Execute Bch  Cond(CC,ifun) Take branch? Memory Write back PC update PC  Bch ? valC : valP Update PC Stage Computation: jump • Compute both addresses • Choose based on setting of condition codesand branch condition XX/ifun Computer Systems – the processor architecture

jmp 7 0 jle 7 1 jl 7 2 je 7 3 jne 7 4 jge 7 5 jg 7 6 Branch conditions JXX Computer Systems – the processor architecture

Execute Logic Datapaths & Control Logic • ALU fun: select function • ALU A: select Input A • ALU B: select Input B • Set CC: Should condition code register be loaded? Computer Systems – the processor architecture

OPl rA, rB Execute valE  valB OP valA Perform ALU operation rmmovl rA, D(rB) Execute valE  valB + valC Compute effective address popl rA Execute valE  valB + 4 Increment stack pointer jXX Dest Execute No operation call Dest Execute valE  valB + –4 Decrement stack pointer ret Execute valE  valB + 4 Increment stack pointer Control logic: ALU A int aluA = [ icode in { IRRMOVL, IOPL } : valA; icode in { IIRMOVL, IRMMOVL, IMRMOVL } : valC; icode in { ICALL, IPUSHL } : -4; icode in { IRET, IPOPL } : 4; # Other instructions don't need ALU ]; Computer Systems – the processor architecture

newPC New PC PC valM data out Data memory read Mem. control Memory write Addr Data Bch valE ALU fun. ALU CC Execute ALU A ALU B valA valB dstE dstM srcA srcB dstE dstM srcA srcB A B Register file M Decode E Write back icode ifun rA rB valC valP Instruction memory PC increment Fetch PC Hardware structure • This can be translated in silicon Computer Systems – the processor architecture

Computer Systems – the processor architecture

. . . . . . . . . . . . Sequential is too slow • Clock has to slow enough to let the signal propagate through all wires and transistors • Critical path: the slowest path between any two storage devices Clk Computer Systems – the processor architecture

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Clock Pipelining • Divide the operations in stages and allow to start the next operation if the first operation is ready with first stage • Increase the throughput, increase latency Computer Systems – the processor architecture

1 2 3 4 5 6 7 8 9 F D E M W F D E M W Cycle 5 D F W M E I4 I1 I2 I3 I5 F D E M W F D E M W F D E M W W_icode, W_valM W_valE, W_valM, W_dstE, W_dstM W valM Data memory M_icode, M_Bch, M_valA Memory Addr, Data M Bch valE CC ALU Execute aluA, aluB E valA, valB d_srcA, d_srcB A B Register file M Decode E Write back D valP icode, ifun, rA, rB, valC valP Instruction memory PC increment Fetch predPC f_PC PC F Insert registers between stages • Pipeline registers means extra silicon and delay Computer Systems – the processor architecture

Data hazards Additional pipeline control is needed to prevent unintended interactions between instructions • Stalling (wait a few stages till hazard is gone) • Data forwarding (passing value to E before M/W) Pipeline architecture already used for i386http://www.pcmech.com/show/processors/35/ Computer Systems – the processor architecture

Pipeline efficiency Pipeline control can prevent many, but not all interactions between instructions → bubbles For the model described in the book: • Load / Use hazards (20% of load instr. → 1 bubble) • Mispredicted branches(40% of jmp instr. → 2 bubbles) • Return from procedure calls(100% of ret instr. → 3 bubbles) Computer Systems – the processor architecture

Today’s architectures • Superscalar (Pentium)(often two instructions/cycle) • Dynamic execution (P6)(three instructions out-of-order/cycle) • Explicit parallelism (Itanium)(six execution units) Computer Systems – the processor architecture

Hyper-Threading http://or1cedar.intel.com/media/training/detect_ht_dt_v1/tutorial/ch6/topic04.htm Computer Systems – the processor architecture

ISA Metrics of performance Answers per month Scaling of algorithms Application Programming Language Compiler (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins Each metric has a place and a purpose, and each can be optimized Computer Systems – the processor architecture

Summary • Shown that an instruction set architecture can be translated onto multiple processor architectures • Complicated control logic on datapaths • Compilers have optimize the control logic for multiple machines/targets • A programmer can add/frustrate compiler Computer Systems – the processor architecture

80 ps 70 ps 30 ps 10 ps 60 ps 50 ps 20 ps A B C D E F R e g Assignment • Practice Problem 4.26 (page 430) Calculate the throughput and latency of a n-stage pipeline for the given 6 blocks Computer Systems – the processor architecture

Computer Systems