Single issue machine with multiple pipes

Single issue machine with multiple pipes • Motivation: single issue but different pipes for integer and f-p operations • We still fetch only 1 instruction/cycle • We still decode only 1 instruction/cycle • But we might have several pipelines and units which are not pipelined • At decode stage decision on which pipe to use • When a unit is pipelined, an operation can be initiated every cycle; if not pipelined must wait for latency CSE 471 Multiple pipes

EX Two sets of registers: integer and f-p; but load/store of f-p registers go through integer pipe: hence conflicts in WB stage Me WB ID M1 M7 IF A1 A4 both Needed at beg of cycle & ready at end of cycle Div CSE 471 Multiple pipes

Unit latencies • Pipelines might have an EXE stage that takes multiple cycles , for example • EXE integer: latency 0 (pipelined) • FP adder: latency 3 (pipelined) • FP (and integer multiply) latency 6 (pipelined) • FP divide (and integer divide) latency 25 (not pipelined) • Result of instruction I can be forwarded to instruction I + 1 + latency CSE 471 Multiple pipes

Hazards in example multiple cycle pipeline • Structural: Yes • Divide unit is not pipelined. Any Divides separated by less than 25 cycles will stall the pipe • RAW: yes • Essentially handled as in integer pipe but with higher frequency of stalls and more forwarding paths • Several writes might be “ready” at the same time • WAW : yes (see in a few slides) • Out of order completion : yes (see in a few slides) CSE 471 Multiple pipes

RAW Example F4 <- Load IF ID EX MeWB F0 <- F4 * F6 IF ID st M1 M2 M3 M4 M5 M6 M7 Me WB F2 <- F0 + F8 IF ID st st stst st A1 A2 A3 A4 Me WB Store <- F2 IF ID EX st st st st st st st Me WB CSE 471 Multiple pipes

Conflict in using the WB stage • Several instructions might want to use the WB stage at the same time • E.g.,A Multd issued at time t and an addd issued at time t + 3 • Solution: reserve the WB stage at ID stage (scheme already used in CRAY-1, a supercomputer built in 1976) • Keep track of WB stage usage in shift register • reserve the right slot. If busy, stall for a cycle and repeat • shift every clock cycle CSE 471 Multiple pipes

WAW Hazards • Instruction I writes f-p register Fx at time t Instruction I + k writes f-p register Fx at time t - m And no instruction I + 1, I +2, I+k uses Fx (otherwise there would be a stall) • Seems unlikely but can occur as result of optimizations but will happen when we look at OOO execution • Only requirement is that I + k ‘s result mot be overwritten • Solutions (besides register renaming that we’ll see later): • Squash I : difficult to know where it is in the pipe • At ID stage check that result register is not a result register in all subsequent stages of other units. If it is, stall appropriately. CSE 471 Multiple pipes

Out-of-order completion • Problem with exception/interrupts • Instruction I finishes at time t Instruction I + k finishes at time t - m No hazard etc. • What happens if instruction I causes an exception at a time in [t-m+1,t] and instruction I + k has written its result? CSE 471 Multiple pipes

Exception handling • Solutions • Do nothing (imprecise exceptions; bad with virtual memory) • Have a precise (by use of testing instructions) and an imprecise mode; restricts concurrency of f-p operations • Buffer results until previous (in order) instructions have completed; can be costly when large differences in latencies but the same technique is used for OOO execution • Restrict concurrency of f-p operations and on an exception “simulate in software” the instructions in between the faulting and the finished one. • Flag early those operations that might result in an exception and stall accordingly CSE 471 Multiple pipes

MIPS pipelines (R4000) • R4000 (about 1993; first 64-bit architecture) • 8 stage integer pipe. • Load delay 2 cycles • Branch delay : 1 delay slot + 2 cycles, no branch prediction hardware (default prediction of branch not taken) • 8 stage f-p pipe • 3 functional units: adder, multiplier, and divider • Stages can be used in any order, multiple times • Thus potential conflicts between independent instructions (structural hazards) • There exists a whole theory on how to deal with this (reservation tables) CSE 471 Multiple pipes

Alpha pipelines • Alpha 21064 (2-way superscalar @1993) and Alpha 21164 (4-way superscalar @1995) • Fastest clock-wise at the time of introduction • 21064 • Ibox (Ifetch and decode: 4 cycles) common to: • Ebox (Integer execution unit: 3 stages) • Fbox (Floating-point execution unit: 6 stages) • Abox (load-store unit: 3 stages) • Stalls can occur only in the first 4 stages. CSE 471 Multiple pipes

Alpha 21164 • Two integer pipelines (1 of them used also for load-store) • Two floating-point pipelines • Still 7 stages for integer and memory pipelines but • Load delay only 1 cycle instead of 2 in 21064 (faster check for TLB) • Mispredict penalty 5 cycles instead of 4 (better branch prediction though) CSE 471 Multiple pipes

MIPS pipelines (R10000) • MIPS R10000 (4-way out-of-order issue ) • 5 pipelines • Common first 2 stages (IF, ID) • 2 Integer ALU’s with 3 more stages (one ALU used for compares; apparently 3 cycles branch taken penalty but … less because of the resume buffer) • 1 Load-store with 4 more stages (1 cycle load delay) • 2 FP units with 5 more stages, 1 for Add, 1 for Mpy and long latency ops such as Div and Sqrt) CSE 471 Multiple pipes

Mispred. penalty: 3 cycles IF ID RF EX WB 2 int ALU’s Load delay: 1 cycle RF Addr Mem WB 1 load/store RF WB EX1 EX2 EX3 1 FP add 1 FP mpy CSE 471 Multiple pipes

Power PC • 601 -- 2-way issue; Slower than Alpha 21064 but OOO • Branch unit (based on condition codes), integer/load/store, f-p • 620 -- 4-way issue; OOO (@1995) • “Traditional” 5 stage pipeline • 2 integer ALU’s + 1 mult/div • 1 load/store unit • 1 FPU unit • 1 Branch unit (sophisticated) and branching based on CC’s • If instruction setting CC’s and branch separated by at least 2 cycles, prediction is always correct • 2k- entries BPT (called BHT) and 256-entries BTB (called BTAC -branch target address cache) • Misprediction penalty 2 or 3 cycles CSE 471 Multiple pipes

Pentium • 2-way superscalar (@1992) • 2 integer ALU’s of the 5 stage variety (not quite) since more stages needed for fetch/align and decode (2 1/2 stages) • First 2 stages common to both pipes • F-P unit has 8 stages (including the common 2); latency of 3 cycles. • Branch penalty. If correct prediction in BTB or branch not taken no delay; otherwise 3 or 4 cycles CSE 471 Multiple pipes

Pentium Pro • OOO issue and completion (@1995) • Separation between • Fetch/decode unit • Functional units • Retire unit • CISC instructions are transformed into RISC uops CSE 471 Multiple pipes

Single issue machine with multiple pipes