Instruction Set Architecture & Pipelining

Instruction Set Architecture & Pipelining CS 505: Computer Architecture Spring 2005 Thu D. Nguyen

Instruction Set Architecture (ISA) software instruction set hardware CS 505: Computer Structures

instruction set hardware Instruction Set Architecture (ISA) software Higher-Level Languages compiler CS 505: Computer Structures

Classes of ISAs CS 505: Computer Structures

Review: Basic ISA Classes • Accumulator: • 1 address add A acc ¬ acc + mem[A] • 1+x address addx A acc ¬ acc + mem[A + x] • Stack: • 0 address add tos ¬ tos + next • General Purpose Register: • 2 address add A B EA(A) ¬ EA(A) + EA(B) • 3 address add A B C EA(A) ¬ EA(B) + EA(C) • Load/Store: • 3 address add Ra Rb Rc Ra ¬ Rb + Rc • load Ra Rb Ra ¬ mem[Rb] • store Ra Rb mem[Rb] ¬ Ra CS 505: Computer Structures

Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based Concept of a Family (B5000 1963) (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (CDC 6600, Cray 1 1963-76) (Vax, Intel 432 1977-80) RISC (Mips,Sparc,HP-PA,IBM RS6000, . . .1987) CS 505: Computer Structures

Issues in Instruction Set Design • Opcodes • Memory addressing • Type and size of operands • Encoding • Implementation (pipelining, exploiting ILP) CS 505: Computer Structures

Addressing Modes • Too many to count? • Register, Immediate, Displacement, Register indirect, Indexed, Direct, Memory indirect, Autoincrement, Autodecrement, Scaled CS 505: Computer Structures

Usage of Addressing Modes CS 505: Computer Structures

Displacement CS 505: Computer Structures

Immediate Usage CS 505: Computer Structures

Size of Immediate Operands CS 505: Computer Structures

Quantitative Design Methodology • Previous study example of quantitative design methodology • Leave you to read the rest of data/results from book •  Data lead to many of the design decisions embodied in today’s RISC processors • So how come Intel (and AMD) are so successful when going against this trend? CS 505: Computer Structures

VAX-11: the canonical CISC Variable format, 2 and 3 address instruction • Rich set of orthogonal address modes • immediate, offset, indexed, autoinc/dec, indirect, indirect+offset • applied to any operand • Simple and complex instructions • synchronization instructions • data structure operations (queues) • polynomial evaluation CS 505: Computer Structures

Review: Load/Store Architectures • ° 3 address GPR • ° Register to register arithmetic • ° Load and store with simple addressing modes (reg + immediate) • ° Simple conditionals • compare ops + branch z • compare&branch • condition code + branch on condition • ° Simple fixed-format encoding MEM reg op r r r op r r immed op offset ° Substantial increase in instructions ° Decrease in data BW (due to many registers) ° Even more significant decrease in CPI (pipelining) ° Cycle time, Real estate, Design time, Design complexity CS 505: Computer Structures

Case Study: MIPS • Simple load-store instruction set • Designed for pipelining efficiency • Efficient as a compiler target CS 505: Computer Structures

MIPS • 32 64-bit GPRs • R0 is always 0 • 32 FPRs (capable of holding double-precision 64-bit values) • Data types: 8-bit byte, 16-bit half words, 32-bit words, 64-bit double words, 32-bit and 64-bit single/double precision floating point • Addressing modes: immediate & displacement • 16-bit fields • Register Indirect? Absolute addressing? CS 505: Computer Structures

MIPS Instruction Format CS 505: Computer Structures

MIPS Instruction Set Arithmetic logical Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV Memory Access LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR Control J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL CS 505: Computer Structures

Instruction Usage CS 505: Computer Structures

Execution Cycle Obtain instruction from program storage Instruction Fetch Determine required actions and instruction size Instruction Decode Locate and obtain operand data Operand Fetch Compute result value or status Execute Deposit results in storage for later use Result Store CS 505: Computer Structures

What’s a Clock Cycle? • Old days: 10 levels of gates • Today: determined by numerous time-of-flight issues + gate delays • clock propagation, wire lengths, etc. Latch or register combinational logic CS 505: Computer Structures

Instruction Fetch Instruction Register Decode & Operand Fetch Operand Registers Result Registers Execute Registers or Mem Store Results Fast, Pipelined Instruction Interpretation IF IF IF IF IF D D D D D E E E E E W W W W W Time CS 505: Computer Structures

6 PM Midnight 7 8 9 11 10 Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r A B C D Sequential Laundry • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? CS 505: Computer Structures

Pipelined laundry takes 3.5 hours for 4 loads 30 40 40 40 40 20 A B C D Pipelined Laundry Start work ASAP 6 PM Midnight 7 8 9 11 10 Time T a s k O r d e r CS 505: Computer Structures

Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup 30 40 40 40 40 20 A B C D Pipelining Lessons 6 PM 7 8 9 Time T a s k O r d e r CS 505: Computer Structures

Instruction Pipelining • Execute billions of instructions, so throughput is what matters • What is desirable in instruction sets for pipelining? • Variable length instructions vs. all instructions same length? • Memory operands part of any operation vs. memory operands only in loads or stores? • Register operand many places in instruction format vs. registers located in same place? CS 505: Computer Structures

Example: MIPS (Note register location) Register-Register 6 5 11 10 31 26 25 21 20 16 15 0 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 0 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 0 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 0 target Op CS 505: Computer Structures

Adder 4 Address Inst ALU 5 Steps of MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX Next SEQ PC Zero? RS1 Reg File MUX RS2 Memory Data Memory L M D RD MUX MUX Sign Extend Imm WB Data Figure 3.1, Page 130, CA:AQA 2e CS 505: Computer Structures

MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU 5 Steps of MIPS Datapath Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD Figure 3.4, Page 134 , CA:AQA 2e CS 505: Computer Structures

Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Visualizing Pipelining Time (clock cycles) I n s t r. O r d e r Figure 3.3, Page 133 , CA:AQA 2e CS 505: Computer Structures

Its Not That Easy for Computers • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) • Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) • Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). CS 505: Computer Structures

Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU DMem Ifetch Example: One Memory Port/Structural HazardFigure 3.6, Page 142 , CA:AQA 2e Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Instr 3 Instr 4 Structural Hazard CS 505: Computer Structures

Resolving structural hazards • Structural hazards: attempt to use same hardware for two different things at the same time • Solution 1: Wait • must detect the hazard • must have mechanism to stall • Solution 2: Throw more hardware at the problem CS 505: Computer Structures

Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU Bubble Bubble Bubble Bubble Bubble Detecting and Resolving Structural Hazard Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Stall Instr 3 CS 505: Computer Structures

MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Eliminating Structural Hazards at Design Time Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Instr Cache RS2 Data Cache MUX MUX Sign Extend WB Data Imm Datapath RD RD RD Control Path CS 505: Computer Structures

Role of Instruction Set Design in Structural Hazard Resolution • Simple to determine the sequence of resources used by an instruction • opcode tells it all • Uniformity in the resource usage • MIPS approach => all instructions flow through same 5-stage pipeling CS 505: Computer Structures

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg Ifetch ALU DMem Ifetch Ifetch Ifetch Ifetch ALU DMem DMem DMem DMem ALU ALU ALU Time (clock cycles) EX WB MEM IF ID/RF I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Data Hazards Figure 3.9, page 147 , CA:AQA 2e CS 505: Computer Structures

Three Generic Data Hazards • Read After Write (RAW)InstrJ tries to read operand before InstrI writes it • Caused by a “Data Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3 CS 505: Computer Structures

I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards • Write After Read (WAR)InstrJ writes operand before InstrI reads it • Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Reads are always in stage 2, and • Writes are always in stage 5 CS 505: Computer Structures

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards • Write After Write (WAW)InstrJ writes operand before InstrI writes it. • Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”. • Can’t happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Writes are always in stage 5 • Will see WAR and WAW in later more complicated pipes CS 505: Computer Structures

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Forwarding to Avoid Data HazardFigure 3.10, Page 149 , CA:AQA 2e Time (clock cycles) CS 505: Computer Structures

ALU HW Change for ForwardingFigure 3.20, Page 161, CA:AQA 2e ID/EX EX/MEM MEM/WR NextPC mux Registers Data Memory mux mux Immediate CS 505: Computer Structures

Reg Reg Reg Reg Reg Reg Reg Reg ALU Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU lwr1, 0(r2) I n s t r. O r d e r sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Data Hazard Even with ForwardingFigure 3.12, Page 153 , CA:AQA 2e Time (clock cycles) CS 505: Computer Structures

Resolving this load hazard • Adding hardware? ... not • Detection? • Compilation techniques? • What is the cost of load delays? CS 505: Computer Structures

Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem ALU Bubble ALU ALU Reg Reg DMem DMem Bubble Reg Reg Resolving the Load Data Hazard Time (clock cycles) I n s t r. O r d e r lwr1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble ALU DMem or r8,r1,r9 CS 505: Computer Structures

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd • Fast code: • LW Rb,b • LW Rc,c • LW Re,e • ADD Ra,Rb,Rc • LW Rf,f • SW a,Ra • SUB Rd,Re,Rf • SW d,Rd CS 505: Computer Structures

Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Control Hazard on Branches=> Three Stage Stall CS 505: Computer Structures

Example: Branch Stall Impact • If 30% branch, Stall 3 cycles significant • Two part solution: • Determine branch taken or not sooner, AND • Compute taken branch address earlier • MIPS branch tests if register = 0 or  0 • MIPS Solution: • Move Zero test to ID/RF stage • Adder to calculate new PC in ID/RF stage • 1 clock cycle penalty for branch versus 3 CS 505: Computer Structures

MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Pipelined MIPS DatapathFigure 3.22, page 163, CA:AQA 2/e Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next SEQ PC Next PC MUX Adder Zero? RS1 Reg File Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD CS 505: Computer Structures

Instruction Set Architecture & Pipelining